Saturday, December 13, 2008

Uptime: It's not rocket science, volume 2

Ok another rant about uptime. More specificly, Server/Application maintenance planning.

I see this all the time. Someone decides to perform an upgrade to a system, be it a forum, game server, or any other service. Great, new stuff. Cool.

But WHY should any type of upgrade require the service to be down while the upgrade is being performed? I've seen sites or part of sites go down for MONTHS for upgrade. I can understand a data transition phase, but that should not take more then a few minutes in most cases, if done properly. Anything that requires no data transition phase should be instant or if it requires a recompile, then only take a few minutes depending on how long it takes to compile.

How it should be done
I run several websites and a game server. NEVER have I taken any of those services down for an upgrade. I do the upgrade on a local test server, when it's ready to go, I upload it all in one shot, replacing the old site. Someone can be on the site, click on a link, BOOM, they see the new site and everything is functional. At worse, they may get a php or other error if it's not done uploading. Now sure it happens where something wont work out as expected but this should not mean the entire service is down. (ex: uploaded wrong config file so paths are wrong - just upload the right one and good to go)

As for my game server Age of Valor, it is scheduled to reboot every day at 8:00 then comes back up after a few minutes. This does two things. 1: clears any possibly leaked memory, 2: Performs any required upgrades. So the server gets a few minutes of downtime per day, and that's constant and expected by players, they know it's coming and know it will be back up very shortly.

So say I do a major change to the game server, such as totally revamping a core system or something simple like adding a new monster. I do it all locally, test it, validate it, and test it some more. Once it's ready I do a patch, upload it to the proper folder on the live game server (which is running normally during this whole time). When it reboots it applies the patch, et voila! performed server upgrade with no irregular downtime. Well scripts have to recompile, so add maybe a minute or two but if compiles took longer I could easily arrange so I don't even have to have that happen.

What about data migration
Ah yes, glad you asked. Yes, that is a bit trickier.

Basically say you have a HUGE sql database that needs to be migrated to work with a brand new system, you can't really have the migration happen while it's live or it may cause problems. BUT, the system that USES the data CAN be done while server is live. Use test data (such as a backup copy of the database from live) to do testing and coding locally.

Once it's ready to go, here's what you do. If it takes less then 5 minutes I say just go ahead and do it. In most cases a forum site or what not should not take that long.

Now if it takes more then that, say a few hours, one option is to set the old system in a read only mode so people can at least view the data, and put a notice up that it's being upgraded. Perform data migration.

For stuff where read only does not apply (such as a game server) then downtime may be the only way to go, but this is strictly migrating the data over - the actual system should be done already. I really can't picture any data migration taking too long if it's properly done. Mysql is very fast, even flat files can be fast. A forum with millions of topics being converted to a different table format should take less than an hour I would imagine.

But out of all instances where I seen a service go down, from a forum to a game server, there was no data migration required, its just a system change but the data is not changed in any way. Those type of upgrades should be instant.

What about physical server move?
Ok you really got me now, maybe there IS a reason for a service to be down.... NOPE. There are ways to prevent downtime even in a server migration.

1: Get new server ready to go, with old data from the existing server's system.
2: Test it very well
3: Since we're talking servers the key here is DNS. Set TTL very low on server's zone (do this a few days in advance)
4: Put old server systemin read only mode if applies
5: move data to new server (This really depends on connection and amount of data but should not take THAT long - in case of mysql, dump db, compress to tar.gz, transfer over, uncompress, apply to db)
6: Switch DNS IP to new server, remove system from old server and leave with a placeholder page that DNS will propagate shortly, possibly instructing users to put something in their host file to force the DNS to change for them.

This applies more or less to a website move but really could apply to anything such as a email server. I'm also assuming there is no failover/cluster setup. IF there is, well this is 10 times easier... A AD Domain controller for example is very easy to migrate to a new server if you have two servers. Just take it offline, as easy as that.

Some specialized applications may also have something that makes it easier to transfer to a new server. It may not be meant for that but it's just a design "feature" that will help. Use such things to your advantage. Use your imagination.

True story

When I moved Age of Valor from a P4 to a quad Xeon, a lot of preparation was involved, most if all of it happened while players were playing normally, as if it was a normal day.

These are in brief, steps I had to take, all doable while server is fully operational:

1: Get server provisioned
2: Log on to server, configure security stuff like firewall, root password, user accounts etc... In most cases this is just copy and paste from old server, but here I was actually migrating from pure windows server to Linux, using a VM for windows. (game server only runs on Windows).
3: Install VMware on server, create a fresh Windows VM.
4: Configure secure remote access to the windows VM. This involves setting to NAT network mode, forwarding the ports on the virtual router, then establishing a SSH tunnel to that private port. RDP can only be accessed if I'm SSHed in the server.
5: With windows up, and me RDPed to it, start configuring windows specific aspects of security and other "new install" tasks.
6: To avoid possible latency by running game packets through the virtual vmware nat, create a new interface that is bridged.
7: Get a second static IP provisioned from provider and use it for this new nic.
8: Forward proper ports through the windows firewall so game server is accessible through this IP.
9: Take latest copy of game server system and data and begin uploading. Now this takes a fair while as the maps alone are almost 1GB. I did it from old server to new, within same data center, so I got decent speeds.
10: Run new game server, fully test to ensure all is working.
11: Configure tasks such as backups and test while game server is running.

Now this is where it got VERY interesting. This is a good example of a very well planned service/data move. Given I don't have the latest data I obviously can't just point DNS to the server right now, right? No problem.

12: I shut off game server for now on new server and I waited for the scheduled server reboot on the old server.
13: Old server rebooted (by reboot, I mean just the game server engine itself, not the OS). Stop game service from starting back up.
14: With both game servers stopped, transfer data from old server to new, while it transfers, switch over DNS, AND modify old server scripts to point to new server should someone be connecting by IP instead of hostname.
15: An extra minute of downtime later, start up server on NEW machine.
16: Watch in amazement as everyone starts logging on, as if nothing ever happened.

This is a good example of using a specific application design at full advantage. The way Ultima Online works is when you log in, you are presented by a server list. That list is simply a list of name, IP, and port. Yep, you guessed it. That script change I made on old server while the files were copying to new, was to make it so the list points to the NEW ip.

I have to admit, that was by far the smoothest server migration I ever did. Literally, it took no less then 5 minutes, as far as user's are concerned. I worked for a few days behind the scenes on this, but all while the server was running live.

I also don't want to sound like I'm bragging because I'm not. This is just how it SHOULD be. There is little to no reason to take a service offline for so long while performing an upgrade or migration.

And this is how the cookie crumbles.

Mmmm, cookies.

Speaking of food, If I need to transfer all the food to a new fridge, should I make all the food inaccessible during this time?

No comments: