How to Troubleshoot Networks

There are different reasons why things go wrong on our networks. Humans make errors in their configurations, hardware can fail, software updates may include bugs, and changing traffic patterns might cause congestion on our networks. To troubleshoot these errors there are different approaches and some are more effective than others.

Troubleshooting consists of 3 steps:

Problem Diagnosis Solution

It all starts when someone or something reports a problem. Often this will be a user that calls the helpdesk because something is not working as expected but it’s also possible that you find issues because of network monitoring (you do monitor your network right?). The next step is to diagnose the problem, and it’s important to find the root of the problem. Once you have found out the problem you will implement a (temporary) solution.

Diagnosing the problem is one of the most important steps to do because we need to find the root cause of the problem. Here’s what we do to diagnose the problem:

  • Collect information: Most of the time a problem report doesn’t give us enough information. Users are very good at reporting “network is down” or “my computer doesn’t work” but this doesn’t tell us anything. We need to collect information by asking our users detailed questions or we use network tools to gather information.
  • Analyze information: Once we have gathered all information, we will analyze it to see what is wrong. We can compare our information to previously collected information or other devices with similar configurations.
  • Eliminate possible causes: We need to think about the possible causes and eliminate the potential causes of the problem. This requires thorough knowledge of the network and all the protocols that are involved.
  • Hypothesize: After eliminating possible causes you will end up with a couple of possible causes that could be the problem. We will select the most likely cause of the problem.
  • Verify hypothesis: We will test our hypothesis to see if we are right or wrong. If we are right, we have a victory…if we are wrong, we try our other possible causes.

If you don’t use a structured approach for troubleshooting, you might “follow your gut feeling” and get confused because you forget what you already tried. It’s also easier if you work together with other network engineers because you can share the steps you already went through.

Structured Troubleshooting Approach

Here are the steps in a nice flowchart:

Structured Troubleshooting Approach

We call this the structured troubleshooting approach. However, if you have a lot of experience with the network you are working on and as you become better at troubleshooting, this approach might be too time-consuming.

Shoot from the hip

Instead of walking through all the different steps in the structured troubleshooting approach, we can also jump from the “collect information” step directly to the “hypothesize” step and skip the “analyze information” and “eliminate possible causes” steps. If you are inexperienced with troubleshooting it’s best to use the structured troubleshooting approach. As you become better at troubleshooting, you might want to skip some of the steps…we call this the shoot-from-the-hip approach:

Shoot from the Hip Troubleshooting

Here’s the shoot from the hip model. The steps that we skip are in blue. If your instincts are wrong, you won’t lose your life, but you will lose valuable time. If you are right, however, you’ll save a lot of time (or become the new sheriff in town).

Troubleshooting approaches

Eliminating possible causes is an important step in the troubleshooting process, and there are a couple of approaches to how you can do this. Here they are:

  • Top-down.
  • Bottom-up.
  • Divide and conquer.
  • Follow the traffic path.
  • Spot the difference.
  • Replace components.

Let’s walk through the different approaches one by one!

Top-Down

Top Down Troubleshooting

Top-down means we start at the top of the OSI model (application layer) and work our way further down to the bottom. The idea is that we will check the application to see if it’s working and assume that if a certain layer is working that all the layers below are also working. If you send a ping from one computer to another (ICMP) you can assume that layers 1,2 and 3 are operational.  The downside of this approach is that you need access to the application you are troubleshooting.

Bottom-Up

Bottom Up Troubleshooting

Bottom-up means we start at the bottom of the OSI model, and we’ll work our way up. We will start with the physical layer, which means we check our cables and connectors, move up to the data link layer to see if Ethernet is working, Spanning tree is working ok, port security is not causing issues, VLANs are configured properly, and then move onto the network layer. Here we will check our IP addresses, access-lists, routing protocols and so on. This approach is very thorough but also time-consuming. If you are new to troubleshooting, I would recommend using this method because you will eliminate all the possible causes for problems.

Divide and conquer

Divide and Conquer Troubleshooting

Divide and conquer means we start in the middle of the OSI model.

You can use this model if you are not sure if top-down or bottom-up is more effective. The idea is that you’ll try to send a ping from one device to another. If the ping works, you know that layers 1-3 are operational, and you can work your way up in the OSI model. If the ping fails, you know something is not right, and you’ll work your way to the bottom of the OSI model.

Follow the traffic path

host switch switch router host

The follow the traffic path is very useful. First, we’ll try to send a ping from H1 to H2. If it fails, we’ll check all the devices in its path. First, we’ll verify if SW1 is configured correctly, it it’s looking good, we’ll move on to SW2, verify it, and then move on to R1.

Spot the difference

Spot the Difference

You’ve probably done one of these before. Spotting the difference in configurations or the output of show commands can be useful, but it’s very easy to miss something. If you have a number of branch routers with a similar configuration and only one is not working, you can see if there’s a difference in the configuration. Network engineers that don’t have a lot of experience usually use this approach. You might be able to solve the problem, but there’s a risk that you don’t really know what you are doing.

Replace Components

Replace Computer Cable SwitchThe last approach to solve our problem is to replace components. Let’s say we have a scenario where a computer is unable to access the network. In the example above, I could replace the computer to eliminate any chance of the computer being the problem. We could replace the cable, and if we suspect the switch, we can replace it with a new one and copy the old configuration to see if there are any hardware problems.

Conclusion

This is all the theory I have to share with you about network maintenance and troubleshooting.

We can talk all day long about different methods and such, but the key to becoming an expert in troubleshooting consists of two things:

  • Truly understand all the different networking protocols like OSPF, EIGRP, BGP, spanning tree, and everything else you have running on your network. You can’t fix something if you have no idea how it works.
  • Gain experience by doing labs and troubleshooting broken networks!

You can read all the books about driving a car, how an engine works, what a clutch is, or how to use your mirrors, but at the end of the day…you have to sit in a car and start driving to learn how to drive a car.

When you are new to troubleshooting, it might be a good idea to create a list for yourself with all the different layers and the things you should check, something like this:

Troubleshooting MindmapThis mindmap created with Xmind is far from complete, but it should give you an idea of how to create one yourself. When you are troubleshooting and discovering new things, add them to your mindmap for the future. You’ll have a nice overview of things to look for per layer when you get stuck. Once you spend more time troubleshooting, the list of things to look for will be hardcoded in your brain, and you’ll troubleshoot on auto-pilot.

I hope this lesson has been useful to you. If you have any questions feel free to leave a comment!

Tags: ,


Forum Replies

  1. great blog … thanks

  2. So Good and very helpful for me.Thanks you RENE!!! Your are very good for us.

  3. I wanted to add some details about the “Collect Information” step:

    • When did the problem first occur?
    • How often does the problem re-occur? (Every 4 hours? Check ARP!)
    • Who is it affecting? (One person, a floor, a building)
    • Were there any other changes happening around the same time?

  4. Those are some good pointers Tristan, thanks for sharing!

  5. I really enjoyed this lesson so much I memorized the main topics as I like letter groupings so I did CAE HV and TBD FSR don’t ask me why that is easy for me to remember but something about those stick in my memory. Just like
    E ACE WNI D for the log messages. I sometimes forget a letters meaning but a quick look puts it back in my memory for weeks if not longer. The grouping is also important for me remember… weird huh lol…

    Anyway I really liked this lesson it stood out for me as one of my favorites.

13 more replies! Ask a question or join the discussion by visiting our Community Forum