Monday, August 29, 2011

Power outages: A server's worse nightmare

So you have a bunch of servers humming away, then suddenly, the power goes out. No problem, the UPS takes over, and when the battery starts getting low, the VMs are sent a ACPI shutdown, and once they are all shut down, the host is then sent an ACPI shutdown, and everything is gracefully shut down.

Later, the power comes back up. I personally like setting my machines to stay off, and I will manually turn them on. This prevents fast on/off if the power is repeatedly coming on and off and the battery is depleted. This is common when the cause of the power is a loose fuse in a pole as it may take a couple tries for the linemen to get it to click in properly... that long pole can't be easy to maneuver! Anyway, I come home once the power is fully restored, and I turn everything on.

This is where the problems begin. When electronic equipment is always on, at some point it's almost like it gets "used to it" and does not like to be shut down then turned on again. Guess the cooling of the components causes issues with the solder or what not. So the server boots up fine, I then start all the VMs. Not 5 minutes into booting up, BANG, one drive falls out of the raid 5 array. I check the serial number and my documentation to check which bay that drive in so I can go swap it with a replacement (I don't have a fancy setup where a light goes on when a drive fails). BANG! Another drive drops out. Game over, the raid array is toast.

After hours of playing around, I actually manage to get the raid array going again. Some data got corrupted and had to be restored from backups, but I did get it back without going 100% to backups. BANG a drive drops again. This time, for fun, I decide to insert it into another slot and let it rebuild. So far so good, but I can't trust this. My suspicion is the backplane, cable, or sata controller. I don't have time to do micro troubleshooting like this... I just want it to work, period. So I order two new backplanes (one is still backordered...) and 4 new drives (2 in there were already less than a week old). I can manage to get everything to work with just one backplane, and just one loose drive, so I go with that for now. So I slowly start removing the Hitachi drives and replace them with the new WD blacks letting it rebuild in between. No issues. I suspect maybe there is something up with those Hitachis so this is just a shot in the dark. Ironicly I had just finished RMAing 5 of them, which came in the mail the same day as the backplanes. But these particular drives were not showing any signs of issues so it was precautionary.

So that's done. Brand new backplane and drives. So far so good... then I start getting errors like this:

INFO: task pdflush:14763 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush D ffff8801dac8fa80 0 14763 2
ffff8800048c9da0 0000000000000046 ffff8800048c9d00 ffffffff8101686f
ffffffff8162a500 ffffffff8162a500 ffff88020f12adc0 ffff8801ff4496e0
ffff88020f12b108 000000020dc05dc6 ffff880028062570 ffff88020f12b108
Call Trace:
[] ? read_tsc+0xe/0x24
[] ? __dequeue_entity+0x61/0x6a
[] ? __switch_to+0x1b0/0x3e0
[] __down_read+0xa3/0xbd
[] down_read+0x2a/0x2e
[] sync_supers+0x4a/0xc4
[] wb_kupdate+0x35/0x119
[] pdflush+0x16e/0x231
[] ? wb_kupdate+0x0/0x119
[] ? pdflush+0x0/0x231
[] ? pdflush+0x0/0x231
[] kthread+0x49/0x76
[] child_rip+0xa/0x11
[] ? restore_args+0x0/0x30
[] ? kthread+0x0/0x76
[] ? child_rip+0x0/0x11

VERY bad. At least, it looks pretty bad. The task that hangs is random each time, and these errors can go on for pages. When it happens, all the VMs crash with IO errors.

Today I ordered a new controller, as that's the only thing that has not been changed short of the motherboard... with sockets changing every couple weeks it seems if I get a new mobo I'll need a new CPU as well, and may as well go with DDR3 and go from 8GB to 16GB. This server is really getting expensive.

Reproducing this error is near impossible as well, as I can give the hard drives a run for their money and they still perform fine, and it wont happen, and let it idle, and it may, or may not happen, for days. What a pain! It seems to always happen overnight, though there's a few instances where it's happened right in front of my eyes. Basically when it does, everything just locks up for a good minute or so.

This is why my next server will be prebuilt. Looking at a Supermicro or even Dell. Then again, even prebuilts can have their fair share of problems and parts are even harder to get. So it's a battle between building a whole new server, or getting a prebuilt, but either way, I think this thing is on it's last legs and in dire need of replacement. I am hoping the new controller will do the trick, however.

No comments: