Monday, June 25, 2007

More Uptime Isn't Better

On Friday Night, my company's building did power maintenance. BC Hydro needs to test something (I'm not exactly sure what), but our building needs to shut down all the power for it to happen. They do this testing once a year at least, and sometimes twice.

We have lots of old hardware that is slowly being rotated out. Thanks to UPS's, some of these systems have years of uptime and the hardware is getting old too. the replacement of the old machines is clearly not happening fast enough; whenever they perform these tests, we tend to have a string of hardware issues afterwards. These are almost always hard drives failures.

We have the building supply us with generators to power our server room so we don't need to shut things off, but not all the servers are in there. Some of the development and test machines are seeded around the office.

Today, we've been dealing with a string of dead or dying hard disks. One of our development database servers is having sector errors and we had to replace a drive on it. Some other machines are in the same boat.

Due to a UPS boo-boo, we had a bunch of servers in a rack power off too. Our Scalix machine also decided that one of the drives shouldn't work after the sudden loss of power.

In our 40-50 year old building, these things are inevitable. They just reinforce the fact that every server machine we buy needs to:

  • have hot swap hard drives
  • use some level of RAID (other than 0, of course)
  • have a list somewhere of what needs to be on the server and what should be tested in an unexpected rebooted (on top of a monitoring system)
  • Replace old hardware frequently
We're getting close on the first 2, but one of the machines isn't hot swappable and needs to be powered down and taken off line to replace the drive. Another system doesn't have redundant drives at all.

The third point is really helpful for the new guys. I've been here a long time and know everything really well. New people don't, and having a page that lists all the key services on a box allows people to help themselves. I definitely need to push to get more of our systems clearly documented.

In my experience, hardware should be replaced in the three year mark of service. Typically, at this point, it's next to impossible to get replacement parts and it's probably cheaper to replace it with a new low end machine anyways. This also fits nicely with support and lease deals, typically.

No comments: