There are different reasons why things go wrong on our networks, humans make errors in their configurations, hardware can fail, software updates may include bugs and changing traffic patterns might cause congestion on our networks. To troubleshoot these errors there are different approaches and some are more effective than others.
Troubleshooting consists of 3 steps:
It all starts when someone or something reports a problem. Often this will be a user that calls the helpdesk because something is not working as expected but it’s also possible that you find issues because of network monitoring (you do monitor your network right?). The next step is to diagnose the problem and it’s important to find the root of the problem. Once you have found out the problem you will implement a (temporary) solution.
Diagnosing the problem is one of the most important steps to do because we need to find the root cause of the problem, here’s what we do to diagnose the problem:
- Collect information: Most of the time a problem report doesn’t give us enough information. Users are very good at reporting “network is down” or “my computer doesn’t work” but this doesn’t tell us anything. We need to collect information by asking our users detailed questions or we use network tools to gather information.
- Analyze information: Once we have gathered all information we will analyze it so see what is wrong. We can compare our information to previously collected information or other devices with similar configurations.
- Eliminate possible causes: We need to think about the possible causes and eliminate the potential causes for the problem. This requires thorough knowledge of the network and all the protocols that are involved.
- Hypothesize: After eliminating possible causes you will end up with a couple of possible causes that could be the problem. We will select the most likely cause for the problem.
- Verify hypothesis: We will test our hypothesis to see if we are right or wrong. If we are right we have a victory…if we are wrong we test our other possible causes.
If you don’t use a structured approach for troubleshooting you might just “follow your gut feeling” and get confused because you forget what you already tried or not. It’s also easier if you work together with other network engineers because you can share the steps you already went through.
Here are the steps in a nice flowchart:
We call this the structured troubleshooting approach. However if you have a lot of experience with the network you are working on and as you become better at troubleshooting this approach might be too time-consuming.
Instead of walking through all the different steps in the structured troubleshooting approach we can also jump from the “collect information” step directly to the “hypothesize” step and skip the “analyze information” and “eliminate possible causes” steps If you are inexperienced with troubleshooting it’s best to use the structured troubleshooting approach. As you become better at troubleshooting you might want to skip some of the steps…we call this the shoot from the hip approach:
Here’s the shoot from the hip model. The steps that we skip are in blue. If your instincts are wrong you won’t lose your life but you will lose valuable time. If you are right however you’ll save a lot of time (or become the new sheriff in town).
Eliminating possible causes is an important step in the troubleshooting process and there are a couple of approaches how you can do this, here they are:
- Divide and conquer.
- Follow the traffic path.
- Spot the difference.
- Replace components.
Let’s walk through the different approaches one-by-one!
Top-down means we start at the top of the OSI model (application layer) and work our way further down to the bottom. The idea is that we will check the application to see if it’s working and assume that if a certain layer is working that all the layers below are also working. If you send a ping from one computer to another (ICMP) you can assume that layer 1,2 and 3 are operational. The downside of this approach is that you need access to the application that you are troubleshooting.
Bottom-up means we start at the bottom of the OSI model and we’ll work our way up. We will start with the physical layer which means we check our cables and connectors, move up to the data link layer to see if Ethernet is working, Spanning-tree is working ok, port security is not causing issue, VLANs are configured properly and then move onto the network layer. Here we will check our IP addresses, access-lists, routing protocols and so on. This approach is very thorough but also time-consuming. If you are new to troubleshooting I would recommend to use this method because you will eliminate all the possible causes for problems.
Divide and conquer means we start in the middle of the OSI-model.
great blog … thanks
So Good and very helpful for me.Thanks you RENE!!! Your are very good for us.
I wanted to add some details about the “Collect Information” step:
- When did the problem first occur?
- How often does the problem re-occur? (Every 4 hours? Check ARP!)
- Who is it affecting? (One person, a floor, a building)
- Were there any other changes happening around the same time?
Those are some good pointers Tristan, thanks for sharing!
I really enjoyed this lesson so much I memorized the main topics as I like letter groupings so I did CAE HV and TBD FSR don’t ask me why that is easy for me to remember but something about those stick in my memory. Just like
E ACE WNI D for the log messages. I sometimes forget a letters meaning but a quick look puts it back in my memory for weeks if not longer. The grouping is also important for me remember… weird huh lol…
Anyway I really liked this lesson it stood out for me as one of my favorites.