Geekularity

Sean O’Steen’s attempt at a well-balanced geek lifestyle.

Know Your Single Points of Failure

Blown 3 Phase transformer connector

I spent the better part of the day and night yesterday recovering from a power failure at one of my client’s data centers; a failure that we did not have a contingency plan for. A transformer inside of the building, one that we thought was solid-state and not a concern, decided to self-destruct in the middle of a busy day, bringing down half of the office and all of the IT infrastructure.

We had battery backups and a diesel generator in the event of power loss, but the diesel generator connects to the building on the other side of the dead transformer, so once the batteries died, we were down. The only thing that could have saved our bacon would have been keeping about 1,000 feet of industrial extension cords on hand to run between the generator’s auxiliary ports and our most critical systems. I think the IT director is putting in for a purchase order this morning.

So, with this little nightmare behind me, I thought I’d try to open a thread of top lessons learned while implementing and supporting server infrastructure. These can be hard-learned lessons, or near misses that you’d like to see IT professionals think about and revisit periodically. I’ll start it off with my top 10, which I’ve picked up in my decade of IT industry experience:

  1. BACK IT UP!.. AND THEN CHECK YOUR BACKUP. I’ve sent way too many hard drives to a clean-room laboratory to try to resurrect the data off of the failed media, all because the victim either didn’t back up their data, or never bothered to check that their backups were working. This is a sophomoric mistake, but it may take a $20,000 invoice from the data restoration company for a systems administrator to finally get religious about backups.
  2. Get servers with redundant power supplies. Plug each power supply into a separate UPS, and each UPS into it’s own breaker, preferably on different legs of the building’s 2 or 3 phase circuit. This will allow you to swap out UPS batteries without bringing down the system and minimizes your exposure to building infrastructure problems (like I experienced last night).
  3. Label every outlet in your data center. At minimum, each outlet should have the breaker number and panel location clearly labeled. You don’t want to be searching franticly for the correct breaker when your UPS units are beeping at you.
  4. All electrical circuits used by your core IT infrastructure should be dedicated circuits. Do not use shared circuits, especially in locations where office tenants could plug in appliances like coffee makers and space heaters, or housekeeping could plug in a vacuum cleaner.
  5. If you have more than one server, label it on the server. If the servers share the keyboard, monitor, and mouse through a KVM switch, make sure that the switch is also clearly labeled, and that the labels are correct.
  6. Have a startup and shutdown procedure for bringing all of the systems down and bringing them back up again. Your data center is an organism, and there are critical services on some machines that need to be up and running before other systems can function. Make sure you know which servers or appliances are hosting DHCP, DNS, SysLog, Active Directory, etc., and make sure those devices are high on the boot order.
  7. For the love of all that is good in the world, use velcro or zip ties to clean up the cabling around your servers. Keep the wires as short as possible and try to prevent any sort of rats nest wherever you can. It will pay huge dividends later when you are trying to isolate essential from non-essential power cords. Plus, clean wiring will promote air flow and will prolong the life of your equipment. If a cord hangs down below where it’s plugged in, it’s too long. If it touches the floor, it’s too long. Any loose cord that’s near the floor or at about hip level where most geeks keep their blackberry holstered, will undoubtedly get pulled out accidentally if not properly secured.
  8. Disks will fail. It’s not a question of IF, it’s a matter of WHEN. So, for every RAID array you maintain, keep one or more spare drives on hand and readily available. If you use your spare, it is imperative that you order a new spare on the same day. Do not put this off.
  9. Have an emergency resource guide inside your data center with phone numbers and reference information. Check and update this information regularly. Phone numbers should include electricians, HVAC, plumbing, fire sprinkler contractors and your building’s facility services hotline. Also include the cell and/or home phone numbers of any company executives that you may need to get emergency purchase approval from. Reference materials must include at the very least, all telco and ISP account and circuit IDs. If the resource guide is securable, you may want to include root and administrator login information for your critical systems.
  10. I don’t drink coffee any more unless it’s in a traveller mug with a close-able lid. My son calls it “Daddy’s sippy-cup.” Even still, coffee, soda, water, or whatever should never enter the server room. I can still recall, in vivid detail, the day the CEO of the company I was working with, dropped his mug three feet in front of an open server cabinet. In slow motion, I watched as a splash of coffee arced gracefully from the shattered mug and into the front of a server’s hot swap drive bay. The result… refer to the invoice mentioned in tip #1.

So this list is just a starter. Please add your tips and tricks to the comment section below!

Tags: , , by seanosteen Wednesday October 10, 2007 11:31 am

Powered by Wordpress