Move post mortem - a manifesto!

Move post mortem - a manifesto!


[ Follow Ups ] [ Post Followup ] [ Forum ]

Posted by Case (Ranked on Ladder) on September 08, 2005 at 00:06:36:

Hi all,

Although we have one major issue left on our lists of things to fix (User Homepages) and I'm sure some minor ones as well, I am happy to report that we have resolved the major problems that we have been having for the past week.

We replaced the troubled CGI13 server this afternoon with a brand new Dual Xenon 3Ghz server, bringing online those leagues that had been offline for over a week. A special thanks to those leagues, they have had the worst of the downtime.

We also today replaced our primary database server, known to many of you as CGI. This is our core machine for running the whole site. Because of it's importance to everything, it isn't something we fuss with often. In the replacement today, we tripled the CPU power, moved to gigabit ethernet (ten times faster than the old one) and increased the storage by seven fold. Additionally, we are now running Red Hat Enterprise Linux, which is a licensed high end Linux distribution designed for servers.

Our new data center has many advantages over the one we moved from. First, we have complete remote power management capabilities. This means if a server locks up we can turn the power off and back on from anywhere (well, almost anywhere). Next, the whole infrastructure runs at gigabit speeds for much faster data transfer. Finally, we also have remote KVM ability. This lets us get into the servers just like we were standing there (so we can adjust things like BIOS settings remotely!).

We also have an additional five brand new servers that we installed that are waiting for use. We will be replacing some additional equipment over the next couple weeks to rotate out old servers. Believe it or not, some servers have been running constantly since 1999! They only last so long and the old ones are not as easy to care for as the new ones.

As most of you know, we encountered some issues during the move. This caused some extended downtime periods the likes of which we have not seen for years. No one likes downtime, especially us. I wanted to give people an update on why we moved, what went wrong, and why.

As I mentioned, the new data center has increased capabilities. It's also cheaper for us to operate there, because with all the automation we will not have to pay for technicians to go into our servers when there are problems. Believe it or not, at our old place just having a server rebooted might cost us $200. This is a big deal for us...While we are owned by a larger company, our budget is ours alone. So, we are still essentially a small business. Every dollar we save on silly things is money we can spend elsewhere on better equipment and more features.

Plus, we had reached the limits on what we could do with some of the old infrastructure. We have some great ideas that we want to do in the near future, and it was too difficult to do them on the old setups. The CGI server worked okay for the most part, but it would creak along during the high stress parts of the day. The notion of doing anything new and cool that would cause more load was out of the question -- now we have more power and more freedom to do things.

Speaking of servers...It is important for people to understand what we are talking about here. This isn't a couple boxes hooked up to a DSL line. We have over $225,000 worth of equipment. It's hosted in a secure facility in Los Angeles and tied into some of the most advanced connections anywhere. My point is -- it costs a lot of money to maintain this infrastructure, so when we do something major like move the servers we have to be careful and take our time. Can you imagine hitting a pothole when transporting that much gear? Believe me, it's Driving Miss Daisy.

There's no doubt we had some screw ups in this move. Because of our new fancy networking equipment, the servers were not able to talk to each other at full speed. What happened was some of our oldest servers are not fully compatiable with the new network...The other servers were trying to talk to them faster than they could handle, causing them to overload, slow down, and for some, to crash completely.

Because we had just moved everything, it wasn't exactly clear to us that this was the problem immediately. It only happened during high load periods, so after a certain point in the day had passed things started working okay for the most part. It was easy to think something else we changed had made a difference until we would find out the next afternoon that we still had problems.

Our team was working their hearts out. Even I, the old man Case himself, was up for 38 hours straight from Monday through Wednesday morning when we did the move. We pushed ourselves to the physical limit trying to fix things, but when you get to a certain point you become too tired and make mistakes. At that point I have to send people home and make them rest, or we break other things.

Now -- I know from the calls, E-mails, and forum posts...People are asking why we don't have more people helping and more servers. The simple (and most honest answer) is money. We did have a couple extra "hired guns" helping us with the physical move and installation, but we don't have a bunch of extra guys running around who know how all our servers and applications work. Just like any business, we so rarely would have a need for, say five guys that we only have three. Having the extra two guys all the time would be too expensive, and as they say that cost would end up getting passed on to you.

Now, folks, most of you know we are a pretty small operation...And I already told you how much equipment we have. We just can't afford at this time to have an extra quarter million dollars of hardware lying around and the guys to manage them. I wish we could, but I'd rather spend our money on better support and new features.

Enough with the complaining. The good news is that once this move is fully completed, we should not have to move for a long time (we were at our last place for over three years). We'll also have the backbone in place we need to do some great new things, and everyone will benefit from the extra speed of the servers.

I'm not going to promise that we won't have problems...Because I am sure we will! But we are going to be better equipped to deal with them when they happen, and we are looking towards the future. Without the ongoing support of our members, we wouldn't be making a great move like this. Despite all the challenges of the move, I am very excited to "be here" and thank you all again for the patience and support.

I hope this answers some of your questions, and I apologize for it being so long before we were able to get to the bottom of this one. I hope you enjoy the new faster site!

Thanks,

Jeremy "Case" Rusnak
President
Case's Ladder, Inc.



Follow Ups:



Post a Followup

Ladder:
Ladder Name:
Password:
Password saved if checked

Subject:

Comments:

Optional Link URL:
Link Title:
Optional Image URL:


[ Follow Ups ] [ Post Followup ] [ Forum ]