Know Your Single Points of Failure

I spent most of the day and night yesterday to convalesce after a power outage at one of my client’s data centers, a failure that we did not have a contingency plan. A transformer inside the building, which we thought were solid-state and not a concern, decided to self-destruct in the middle of a busy day, to bring down half the office and all the IT infrastructure.

We had battery backup and a diesel generator in case of power failure, but the diesel generator is connected to the building on the other side of the dead transformer, so when the batteries died, we were down. The only thing that could have saved our bacon would have been to keep about 1,000 feet of industrial extension cords on hand to run the generator auxiliary power ports and our most critical systems. I think the CIO is in the process of a purchase order in the morning.

So with this little nightmare behind me, I thought that I would try to open a line of top lessons learned in implementing and supporting server infrastructure. These can be hard experience, or near, you would like to see IT professionals think about and come back regularly. I’ll start it off with my top 10, which I have picked up in my decade of IT industry experience:

Back it up! .. And then check your backup. I have sent too many hard drives for a clean-room laboratory in an attempt to revive the data out of the failure of the media, all because the victim either did not backup their data, or never bothered to check that their backups worked. It’s a sophomoric mistake, but it may take a $ 20,000 invoice from the data restoration company for a system to finally get religious About backups.

Get servers with redundant power supplies. Plug each power supply to a separate UPS, each UPS into its own switch, preferably on different legs of the building 2 or 3-phase circuits. This will allow you to replace UPS batteries without bringing the system down and minimize your exposure to building infrastructure problems (which I saw last night).
Label each outlet in your data center. At minimum, each outlet having the switch number and panel location is clearly marked. You do not want to be searching franticly for proper relay when your UPS units is beeping at you.
All electrical circuits are used by your core IT infrastructure should be dedicated circuits. Do not use shared circuits, especially in locations where Office tenants could connect appliances like coffee machines and space heating, or cleaning can connect a vacuum cleaner.
If you have more than one server, mark it on the server. If the servers share the keyboard, monitor and mouse through a KVM switch, make sure that contact is also clearly labeled, and labels are correct.
Have a startup and shutdown procedure to bring all the systems down and bring them back up again. Your data center is an organism, and there are critical services on some machines, which should be up and running before other systems can operate. Make sure you know which servers or appliances are hosting DHCP, DNS, Syslog, Active Directory, etc., and make sure the devices are high on the boot sequence.
For the love of everything good in the world, using Velcro or zip ties to tidy up the cables around your servers. Keep wires as short as possible and try to prevent any rats nest where you can. It will pay huge dividends later when you try to isolate essential from non-essential power cables. Plus, clean wires promote airflow and will extend the life of your equipment. If a wire hanging down there, where it is inserted, it is too long. If it touches the floor, it is too long. Any loose wire that is close to the floor or around waist level, where most geeks keep their BlackBerry holstered, will without doubt get pulled out by accident, if not properly secured.
Disks will fail. It’s not a question of IF, it is a question of when. So for each RAID array you hold on to keep one or more spare drives on hand and readily available. If you spend your leisure time, it is imperative that you book a new reservation on the same day. Do not put this out.
Have an emergency resource guide inside your data center with phone numbers and reference information. Check and update it periodically. Phone numbers should include electricians, plumbers, plumbing, fire sprinkler contractors and your building’s Facility Services hotline. Also include the cell and / or home telephone numbers of any company’s management, you might need to get emergency purchase approval. Reference materials must include at least, all telco and ISP account and circuit IDs. If the resource guide is securable, you may want to include root and administrator login information for your critical systems.
I do not drink coffee more, unless it is in a travel mug with a close-able lid. My son calls it “Daddy’s sippy-cup.” Even still, coffee, soda, water, or what ever into the server room. I can still remember in vivid detail the day when the CEO of the company I worked with, dropped his mug three feet in front of an open server chassis. In slow motion, so I like a splash of coffee arched gracefully from broken mugs and in front of a server’s hot swap drives. The result … refer to the invoice mentioned in tip # 1
So this list is just a starter. Please add your tips and tricks for the comment below!

 

Comments Off

Filed under Education, Learning, Self Help

Comments are closed.