HP DL380 G7 won't power up after a power cord removal/power loss

Greetings,

     If you came here it means that you see what I'd seen - a once healthy DL380G7 after power cord removal sitting with orange power led and no reaction to power button, or the same server with a red health light blinking or solid and a bizarre combination of orange leds on health panel indicating various nonexistent failures (e.g. CPU failure, Fan solution not sufficient, DIMM error, etc).
     Likewise, if you're here, you've likely read all the musings about this same situation on HPE/serverfault/stackexchange and neither of those did help your recovery. You might have even contacted HPE to get a solid quote on motherboard replacement. Been there, seen that. In fact, we've almost trashed one of our own servers, but then we've decided to go into a deep and thorough troubleshooting.
     Now, before you decide to trash your server(s) and opt to HPE services, there's a small but important step you might have missed, cause its not included in all those cheat sheets and service diagrams. What you really need to do is remove the PSUs, then remove the CMOS battery and check it with a multimeter. If it reads anywhere less than 2.9 Volts, trash that battery and get a new one. 
    I'm pretty sure that if you do so, your sever might power back up granted that battery has been removed for more than a minute and replaced with a fresh one. However, if it does, you're not yet completely safe, as what you need to do next is:

  1. Clear the NVRAM by removing PSUs, moving DIP switch #6 on System Maintenance Switch to On position, resetting the PSU and powering a server until the reset is completed (it will be indicated on screen), then resetting System Maintenance Switch to operating (all off) state.
  2. At your convenience doing the same reset sequence but with DIP switches #1, #5 and #6 set to On.
  3. Apply iLO3 firmware update to version 1.65 and pay attention to this HPE Advisory

   Ok, I'm sure that by now you should have questions as to why does it happen and how did we came up with above. So read on below:

     That's pretty simple and at the same time quite complicated. To make a long story short, it all boils down to NVRAM corruption that happens when AC power is removed and battery is not in a good shape. According to the above-mentioned advisory, iLO3 attempts to record an event upon this condition and with insufficient power it records garbage at random memory addresses.
   If you did not reset your NVRAM after battery has been replaced, you might see strange characters (e.g. umlauts) in BIOS setup instead of server's serial, model, inventory tags, or even MAC addresses, as well as experience random automatic power-up delays up to 15 minutes. All of these has the same cause, and is completely cured with proper battery and NVRAM reset (however, you may need to manually enter your Serial number and server model).
   Initially when we've encountered this situation we've thought that the motherboard was fried, specifically because we had a UPS failure during the power surge that led to this condition. What we had was a RED health light and orange CPU #1 light, and anything we've tried, including removal and reinsertion of the same CMOS battery, did not help us. Upon applying AC power, the server just sit there with these LEDs making no attempt to power-up. To add more suspense, for about a year before this happened, we had a problematic power up after AC power loss with random delays. So we thought it was a mobo, but we were wrong..
   We've replaced a server and let it lie down in our storage for a few months, then, upon a routine check a few months later we've discovered a sudden change in its behavior, when power was applied it even attempted to power-up, which we haven't seen before. From there our quest had started.
   When all the CPUs and memory where loaded, surprisingly, server powered up and allowed us to roll out Linux, switch it on and off with no single glitch. Up until I've decided to remove and reinsert the power cord - then, I've seen the same red health led and CPU #1 failure.
   Next, after some deep and gruesome considerations, I've decided to check the battery and was surprised to find it reading 1.9 Volts. With that battery left on the table it quickly degraded to 0.35 Volts within a couple of days. We've replaced the battery, reset the CMOS and haven't seen this problem ever since.
    To further test these results we've embarked on a rough and crazy testing routine - we've loaded OS, set the server to auto power-up with no delay, let the server run, and then, every day on every occasion when someone was near the server we've abruptly removed the power cord and reinserted it back. We did it with two servers in the office for about a month - I think its a fair trial. No signs of any problem.
    Next, we've put that bad battery in and did the same - upon third of fourth attempt we've seen Fan Solution not sufficient (on a single CPU system, mind you), continuous reboot and finally a red light with CPU error. Nice thing, right? That same problem was completely replicated on the second server as well.
    Now, if you read it till here, here's a final issue we were unable to solve - if you remove the AC power and then immediately reapply it within a second or two, the red health light will blink, orange PSU failure will lit and there would be no way to recover from this situation other than completely remove AC power for at least 10 seconds. We've tested this with four different PSUs on two G7 servers and were able to replicate that with almost 100% probability. However, if you're on any decent UPS system, this should not be a problem.

Comments

Post a Comment

Popular posts from this blog

ESXi 7.0/6.7 and MegaRaid on Alder Lake Asrock Z690M-ITX/ax

Accessing MegaRAID BIOS (WebBIOS, Ctrl-H) on consumer motherboards