Tuesday, October 10, 2006

Strategies for Improving Network Uptime



Improving your network uptime is one of the top goals of every network administrator and should, like everything else, be approached in a systematic way. It doesn't matter how good you are or how much experience you've got. If you're not paying attention to your the weaknesses in your network, your reliability will quickly erode.

You should have 3 basic goals in mind when looking at improving your uptime:

Goal 1. Prevention. Prevent problems from happening in the first place.
Goal 2. When a problem does come up, work towards a fast resolution.
Goal 3. Accurate Planning. Plan your changes. Test them as needed.

With those goals in mind, here are some good practices that can help improve network uptime:

1. Look for single points of failure

Reducing or eliminating single points of failure on your network goes a long way to increasing reliability. Good network design aside, daisy chaining switches is a good example where an unnecessary point of failure has been introduced.

2. Pay attention to failure rates and implement redundancy where it makes sense

All too often, network administrators beef up redundancy in the wrong places. What good is adding a redundant firewall to your network if your critical app is sitting on a 10 year old desktop with an IDE hard drive? The failure rate of the firewall is a tiny fraction compared to that old server.

Within servers, disk drives, power supplies and other devices with moving parts are the best place to start looking for high failure rates. Servers themselves have an overall failure rate, and should also be considered candidates for redundancy if the app is critical enough.

3. Monitor your network

Each component of the network should be actively monitored. Monitoring is a subject in itself, but for now, the key things to be looking at are a combination of ping and real-time log processing/alerting. This allows you to respond to disk failures, fried switch ports and most anything else that can fail. Traditionally, I have also monitored device statistics such as CPU, memory, disk space, etc., setting key thresholds to alert me if anything is out of the ordinary.

Logwatch
Nagios
Augur
Netcool

4. Perform regular walkthroughs

While you can catch a lot of issues through a good monitoring software implementation, there's no substitute for physically observing your network. Error lights, overloaded UPS beeps, overtemp alarms and who knows what else are sometimes only directly observable.


5. Keep your network neat

This is very often overlooked. The catch here is that it is extremely easy to yank the wrong cable, power off the wrong server or even trip and fall, possibly taking down your whole network! Even if you're exceedingly careful, a disorganized network takes significantly longer to troubleshoot and repair when the pressure is on.
  • Label and tie all cables
  • Have an up-to-date network map posted
  • Label all servers front and back
  • Always mount hardware properly and use cable management features

6. Document it

There isn't much to say here. We all let documentation slide sometimes... that's human nature. But its not hard to see how accurate docs lead to faster problem resolution.

7. Build contingency/disaster recover plans

Having a solid plan in place to recover from the loss of each component of the network helps to speed up recovery time. Probably more importantly, however, is the insight gained through the process. Often, this process lights up need for critical spares and replacement parts well before they are actually needed.

8. Use change control

This one has been batted around a lot, but it comes down to a few simple concepts.
1. Don't make changes to production systems without planning them.
2. Get others involved that may be impacted by those changes so they can have their concerns met.
3. Plan your exit strategy each step of the way, in case of both success and failure.

Really, there are books out there on change control, but what's important here are those key concepts. Your plan can be a fully prepared form with multiple signoffs, or just a (carefully!) noted checklist on scrap paper... what's important is that you follow it.

9. Maintain warantees and service contracts

If HP will have a guy onsite in 4 hours to replace your failed CPU, you could have called them 10 minutes in. If you are using custom-built hardware, seek out onsite service options or plan on keeping a full set of spares around. Nothing beats knowing you have replacement parts available or on the way so you can give 100% of your focus to migrating services, managing user expectations, etc.

10. Maintain backups

Backups are key to getting your data back quickly and effectively. A combination of online and archival backup techniques help you attack those big disaster risks.

11. Simplify

Complex networks lead to complex problems. Your network should be as simple as possible while meeting the needs for future growth. For example, if you retire a bunch of VLAN on your network, clean up your switch configs.

12. Use maintenance windows


You should plan to do your maintenance in a maintenance window. It sounds like common sense, but it's one of the best ways to cut down on surprises. And, depending on who your customer is and what the app requirements are, you can often agree to exclude planned maintenance windows from your uptime guarantees.

A lot of the practices on this list sound like common sense, and as well they should! Many common IT best practices come into play when talking about network availability and server uptime. If you are optimizing for uptime, however, you should consider them as key points to consider.

No comments: