Tuesday, August 30, 2011

Gimme my new job!

I got accepted for a new position April 14 2011... it's now almost September 1st. The IT manager here is being more and more of an asshole with me latetly. Get me out of this place! The beauty of the new job is it wont be customer facing. Sooooo looking forward to it.

Today he starts freaking out because a user could not login to their mail, so I check into it, turns out they were part of a list to be deleted, so I restored it. He starts flipping out that I did not contact him, well... if he had used the proper process of putting in a ticket instead of just barging in here like a mental, then maybe I would of had a call back number. Then he starts flipping out at me for not putting the contact information in the account. woah woah there asshole, I'm not even the one that created that account, and it's not my fault HR does not provide the information half the time. DIAF. I so wish I could tell him off sometimes. He can be an asshole towards us, and we just have to take it up the ass. Sick of this. It's sad how people like him can have enough power to get away with this verbal abuse.

More on Hedge Clearing

This was done a while back, but thought I'd post. The stumps have been removed:

A bigger job than it seems, to do without machinery. Used an axe and shovel to rip em right out. Grass will eventually be grown there, but that will wait till next year after the weeping tiles are done.

Monday, August 29, 2011

Power outages: A server's worse nightmare

So you have a bunch of servers humming away, then suddenly, the power goes out. No problem, the UPS takes over, and when the battery starts getting low, the VMs are sent a ACPI shutdown, and once they are all shut down, the host is then sent an ACPI shutdown, and everything is gracefully shut down.

Later, the power comes back up. I personally like setting my machines to stay off, and I will manually turn them on. This prevents fast on/off if the power is repeatedly coming on and off and the battery is depleted. This is common when the cause of the power is a loose fuse in a pole as it may take a couple tries for the linemen to get it to click in properly... that long pole can't be easy to maneuver! Anyway, I come home once the power is fully restored, and I turn everything on.

This is where the problems begin. When electronic equipment is always on, at some point it's almost like it gets "used to it" and does not like to be shut down then turned on again. Guess the cooling of the components causes issues with the solder or what not. So the server boots up fine, I then start all the VMs. Not 5 minutes into booting up, BANG, one drive falls out of the raid 5 array. I check the serial number and my documentation to check which bay that drive in so I can go swap it with a replacement (I don't have a fancy setup where a light goes on when a drive fails). BANG! Another drive drops out. Game over, the raid array is toast.

After hours of playing around, I actually manage to get the raid array going again. Some data got corrupted and had to be restored from backups, but I did get it back without going 100% to backups. BANG a drive drops again. This time, for fun, I decide to insert it into another slot and let it rebuild. So far so good, but I can't trust this. My suspicion is the backplane, cable, or sata controller. I don't have time to do micro troubleshooting like this... I just want it to work, period. So I order two new backplanes (one is still backordered...) and 4 new drives (2 in there were already less than a week old). I can manage to get everything to work with just one backplane, and just one loose drive, so I go with that for now. So I slowly start removing the Hitachi drives and replace them with the new WD blacks letting it rebuild in between. No issues. I suspect maybe there is something up with those Hitachis so this is just a shot in the dark. Ironicly I had just finished RMAing 5 of them, which came in the mail the same day as the backplanes. But these particular drives were not showing any signs of issues so it was precautionary.

So that's done. Brand new backplane and drives. So far so good... then I start getting errors like this:

INFO: task pdflush:14763 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush D ffff8801dac8fa80 0 14763 2
ffff8800048c9da0 0000000000000046 ffff8800048c9d00 ffffffff8101686f
ffffffff8162a500 ffffffff8162a500 ffff88020f12adc0 ffff8801ff4496e0
ffff88020f12b108 000000020dc05dc6 ffff880028062570 ffff88020f12b108
Call Trace:
[] ? read_tsc+0xe/0x24
[] ? __dequeue_entity+0x61/0x6a
[] ? __switch_to+0x1b0/0x3e0
[] __down_read+0xa3/0xbd
[] down_read+0x2a/0x2e
[] sync_supers+0x4a/0xc4
[] wb_kupdate+0x35/0x119
[] pdflush+0x16e/0x231
[] ? wb_kupdate+0x0/0x119
[] ? pdflush+0x0/0x231
[] ? pdflush+0x0/0x231
[] kthread+0x49/0x76
[] child_rip+0xa/0x11
[] ? restore_args+0x0/0x30
[] ? kthread+0x0/0x76
[] ? child_rip+0x0/0x11

VERY bad. At least, it looks pretty bad. The task that hangs is random each time, and these errors can go on for pages. When it happens, all the VMs crash with IO errors.

Today I ordered a new controller, as that's the only thing that has not been changed short of the motherboard... with sockets changing every couple weeks it seems if I get a new mobo I'll need a new CPU as well, and may as well go with DDR3 and go from 8GB to 16GB. This server is really getting expensive.

Reproducing this error is near impossible as well, as I can give the hard drives a run for their money and they still perform fine, and it wont happen, and let it idle, and it may, or may not happen, for days. What a pain! It seems to always happen overnight, though there's a few instances where it's happened right in front of my eyes. Basically when it does, everything just locks up for a good minute or so.

This is why my next server will be prebuilt. Looking at a Supermicro or even Dell. Then again, even prebuilts can have their fair share of problems and parts are even harder to get. So it's a battle between building a whole new server, or getting a prebuilt, but either way, I think this thing is on it's last legs and in dire need of replacement. I am hoping the new controller will do the trick, however.