logTail: 2006

Recently, HP announced that it will be fully supporting Debian Sarge on its servers (http://news.zdnet.com/2100-3513_22-6104891.html) This is great news for the Debian community, of course, because a huge vendor stand behind it leads to some wonderful perks. For those of us that would like to advocate Debian solutions, support here really goes a long way.

Well, that's all good in theory, but I have to relate my recent real world experience here. We planned on rolling out a new website on 4 new HP Proliant DL320 servers with SATA hardware RAID. This seemed like an excellent platform, until we actually started installing the latest and greatest copy of Debian 3.1 (Sarge) on it. We ran into several problems.

1. The RAID containers were not recognized.

We could not get the system to install properly at all using the stock kernel here. Once we managed to grab the latest kernel, we were able to install successfully. Once we had a running system, however, we realized that the RAID containers were not recognized at all, and the driver in use was actually operating on the on board RAID card as if it were just a SCSI host bus adapter. Writes to one disk were not mirrored... not good!

A quick call to HP Linux support got us a very friendly and knowledgeable technician. Unfortunately, he told us flat out that the hardware was not supported yet! They were working furiously to get drivers ready for download, and that we'd probably be looking at December 2006 for the office support from HP. He told us that the announcement took the support folks by surprise and they were not really in a good place to support all of the Proliant hardware as promised.

2. USB bus would not initialize properly.

We were left with a hung system during reboot as it attempted to load the USB modules. Having purchased the HP remote management cards, and discovering that we are unable to use USB under Linux, we were left with no choice but to disable the hardware and USB support for these boxes. This effectively made our management cards useless. Our only option here was to basically wait out HP support or go with a 3rd party fix that the HP Linux support tech quietly managed to tell us about.

So, HP left us in a tough spot here with these machines. I have to say that their support team was extremely helpful and knowledgeable throughout the process. I could definitely sense their frustration, though, at the premature announcement. Hopefully, their software folks can get some stable drivers cranked out soon so they can make good on the promise of Linux support for the Proliant series.

Improving your network uptime is one of the top goals of every network administrator and should, like everything else, be approached in a systematic way. It doesn't matter how good you are or how much experience you've got. If you're not paying attention to your the weaknesses in your network, your reliability will quickly erode.

You should have 3 basic goals in mind when looking at improving your uptime:

Goal 1. Prevention. Prevent problems from happening in the first place.
Goal 2. When a problem does come up, work towards a fast resolution.
Goal 3. Accurate Planning. Plan your changes. Test them as needed.

With those goals in mind, here are some good practices that can help improve network uptime:

1. Look for single points of failure

Reducing or eliminating single points of failure on your network goes a long way to increasing reliability. Good network design aside, daisy chaining switches is a good example where an unnecessary point of failure has been introduced.

2. Pay attention to failure rates and implement redundancy where it makes sense

All too often, network administrators beef up redundancy in the wrong places. What good is adding a redundant firewall to your network if your critical app is sitting on a 10 year old desktop with an IDE hard drive? The failure rate of the firewall is a tiny fraction compared to that old server.

Within servers, disk drives, power supplies and other devices with moving parts are the best place to start looking for high failure rates. Servers themselves have an overall failure rate, and should also be considered candidates for redundancy if the app is critical enough.

3. Monitor your network

Each component of the network should be actively monitored. Monitoring is a subject in itself, but for now, the key things to be looking at are a combination of ping and real-time log processing/alerting. This allows you to respond to disk failures, fried switch ports and most anything else that can fail. Traditionally, I have also monitored device statistics such as CPU, memory, disk space, etc., setting key thresholds to alert me if anything is out of the ordinary.

Logwatch
Nagios
Augur
Netcool

4. Perform regular walkthroughs

While you can catch a lot of issues through a good monitoring software implementation, there's no substitute for physically observing your network. Error lights, overloaded UPS beeps, overtemp alarms and who knows what else are sometimes only directly observable.

5. Keep your network neat

This is very often overlooked. The catch here is that it is extremely easy to yank the wrong cable, power off the wrong server or even trip and fall, possibly taking down your whole network! Even if you're exceedingly careful, a disorganized network takes significantly longer to troubleshoot and repair when the pressure is on.

Label and tie all cables
Have an up-to-date network map posted
Label all servers front and back
Always mount hardware properly and use cable management features

6. Document it

There isn't much to say here. We all let documentation slide sometimes... that's human nature. But its not hard to see how accurate docs lead to faster problem resolution.

7. Build contingency/disaster recover plans

Having a solid plan in place to recover from the loss of each component of the network helps to speed up recovery time. Probably more importantly, however, is the insight gained through the process. Often, this process lights up need for critical spares and replacement parts well before they are actually needed.

8. Use change control

This one has been batted around a lot, but it comes down to a few simple concepts.
1. Don't make changes to production systems without planning them.
2. Get others involved that may be impacted by those changes so they can have their concerns met.
3. Plan your exit strategy each step of the way, in case of both success and failure.

Really, there are books out there on change control, but what's important here are those key concepts. Your plan can be a fully prepared form with multiple signoffs, or just a (carefully!) noted checklist on scrap paper... what's important is that you follow it.

9. Maintain warantees and service contracts

If HP will have a guy onsite in 4 hours to replace your failed CPU, you could have called them 10 minutes in. If you are using custom-built hardware, seek out onsite service options or plan on keeping a full set of spares around. Nothing beats knowing you have replacement parts available or on the way so you can give 100% of your focus to migrating services, managing user expectations, etc.

10. Maintain backups

Backups are key to getting your data back quickly and effectively. A combination of online and archival backup techniques help you attack those big disaster risks.

11. Simplify

Complex networks lead to complex problems. Your network should be as simple as possible while meeting the needs for future growth. For example, if you retire a bunch of VLAN on your network, clean up your switch configs.

12. Use maintenance windows

You should plan to do your maintenance in a maintenance window. It sounds like common sense, but it's one of the best ways to cut down on surprises. And, depending on who your customer is and what the app requirements are, you can often agree to exclude planned maintenance windows from your uptime guarantees.

A lot of the practices on this list sound like common sense, and as well they should! Many common IT best practices come into play when talking about network availability and server uptime. If you are optimizing for uptime, however, you should consider them as key points to consider.

logTail

Friday, October 13, 2006

Problems with New HP Proliant Servers and Debian Sarge

Tuesday, October 10, 2006

Strategies for Improving Network Uptime

Contact Me

Links

Blog Archive

Slashdot: Linux

Labels