Geekularity

Sean O’Steen’s attempt at a well-balanced geek lifestyle.

Know Your Single Points of Failure

Blown 3 Phase transformer connector

I spent the better part of the day and night yesterday recovering from a power failure at one of my client’s data centers; a failure that we did not have a contingency plan for. A transformer inside of the building, one that we thought was solid-state and not a concern, decided to self-destruct in the middle of a busy day, bringing down half of the office and all of the IT infrastructure.

We had battery backups and a diesel generator in the event of power loss, but the diesel generator connects to the building on the other side of the dead transformer, so once the batteries died, we were down. The only thing that could have saved our bacon would have been keeping about 1,000 feet of industrial extension cords on hand to run between the generator’s auxiliary ports and our most critical systems. I think the IT director is putting in for a purchase order this morning.

So, with this little nightmare behind me, I thought I’d try to open a thread of top lessons learned while implementing and supporting server infrastructure. These can be hard-learned lessons, or near misses that you’d like to see IT professionals think about and revisit periodically. I’ll start it off with my top 10, which I’ve picked up in my decade of IT industry experience:

  1. BACK IT UP!.. AND THEN CHECK YOUR BACKUP. I’ve sent way too many hard drives to a clean-room laboratory to try to resurrect the data off of the failed media, all because the victim either didn’t back up their data, or never bothered to check that their backups were working. This is a sophomoric mistake, but it may take a $20,000 invoice from the data restoration company for a systems administrator to finally get religious about backups.
  2. Get servers with redundant power supplies. Plug each power supply into a separate UPS, and each UPS into it’s own breaker, preferably on different legs of the building’s 2 or 3 phase circuit. This will allow you to swap out UPS batteries without bringing down the system and minimizes your exposure to building infrastructure problems (like I experienced last night).
  3. Label every outlet in your data center. At minimum, each outlet should have the breaker number and panel location clearly labeled. You don’t want to be searching franticly for the correct breaker when your UPS units are beeping at you.
  4. All electrical circuits used by your core IT infrastructure should be dedicated circuits. Do not use shared circuits, especially in locations where office tenants could plug in appliances like coffee makers and space heaters, or housekeeping could plug in a vacuum cleaner.
  5. If you have more than one server, label it on the server. If the servers share the keyboard, monitor, and mouse through a KVM switch, make sure that the switch is also clearly labeled, and that the labels are correct.
  6. Have a startup and shutdown procedure for bringing all of the systems down and bringing them back up again. Your data center is an organism, and there are critical services on some machines that need to be up and running before other systems can function. Make sure you know which servers or appliances are hosting DHCP, DNS, SysLog, Active Directory, etc., and make sure those devices are high on the boot order.
  7. For the love of all that is good in the world, use velcro or zip ties to clean up the cabling around your servers. Keep the wires as short as possible and try to prevent any sort of rats nest wherever you can. It will pay huge dividends later when you are trying to isolate essential from non-essential power cords. Plus, clean wiring will promote air flow and will prolong the life of your equipment. If a cord hangs down below where it’s plugged in, it’s too long. If it touches the floor, it’s too long. Any loose cord that’s near the floor or at about hip level where most geeks keep their blackberry holstered, will undoubtedly get pulled out accidentally if not properly secured.
  8. Disks will fail. It’s not a question of IF, it’s a matter of WHEN. So, for every RAID array you maintain, keep one or more spare drives on hand and readily available. If you use your spare, it is imperative that you order a new spare on the same day. Do not put this off.
  9. Have an emergency resource guide inside your data center with phone numbers and reference information. Check and update this information regularly. Phone numbers should include electricians, HVAC, plumbing, fire sprinkler contractors and your building’s facility services hotline. Also include the cell and/or home phone numbers of any company executives that you may need to get emergency purchase approval from. Reference materials must include at the very least, all telco and ISP account and circuit IDs. If the resource guide is securable, you may want to include root and administrator login information for your critical systems.
  10. I don’t drink coffee any more unless it’s in a traveller mug with a close-able lid. My son calls it “Daddy’s sippy-cup.” Even still, coffee, soda, water, or whatever should never enter the server room. I can still recall, in vivid detail, the day the CEO of the company I was working with, dropped his mug three feet in front of an open server cabinet. In slow motion, I watched as a splash of coffee arced gracefully from the shattered mug and into the front of a server’s hot swap drive bay. The result… refer to the invoice mentioned in tip #1.

So this list is just a starter. Please add your tips and tricks to the comment section below!

Tags: , , by seanosteen Wednesday October 10, 2007 11:31 am

1 Comment

  1. Well, usually when it is about me to check some little details about the domain structure I use Microsoft Active Directory Topology Diagrammer http://www.microsoft.com/downloads/details.aspx?familyid=cb42fc06-50c7-47ed-a65c-862661742764. But when it is about making a deep exploration, reveal security breaches and find out what’s happening to my domain controller performance or security on certain computers within some site I prefer using a tool from Scriptlogic called Enterprise Security Reporter http://www.scriptlogic.com/products/enterprisesecurityreporter/ . The tool can not only make a thorough scan in a manner that is similar to Microsoft Baseline Security Analyzer but it even can output a detailed report on the information that I asked it to gather. Fantastic thing about the tool is that it stores data in a SQL database- that provides scalability and allows me to filter data so that I can quickly sample collected data even days after the actual scan was performed. I can even set it to make periodical observing operations automatically. Then I can extract data, export it as a PDF file and mail to a person responsible for storing the security related data in our network. It also allows revealing inconsistencies between different computer states as I can say make a consistent scan on a remote computer checking for values contained within some registry hive on a remote computer and then having all the information about what happened to that hive during the specified period quickly check if something had happening to the hive. This is extremely helpful when monitoring the state of Active Directory attributes.

    Said by David Wendell December 5, 2007 at about 3:57 am

Sorry, the comment form is closed at this time.

Powered by Wordpress