AI and ML in Networking

Artificial Intelligence (AI) and Machine Learning (ML) have a major impact on almost every organization and individual. Things changed quickly with the rise of generative AI and when OpenAI released ChatGPT to the general public in November 2022. In the AI and ML lesson, we discussed the differences between AI and ML and the different AI types.

AI and ML are also used in networking. Networking complexity has increased because we have larger networks, with multi-tenancy and more virtualization, such as overlay networks. Modern networks require more than traditional rule-based management systems. AI and ML allow us to analyze data in real time, identify patterns, predict issues, and even automate decision-making.

In this lesson, we’ll focus on AI and ML for networking. Some people might fear AI will replace network engineers, but in reality, it will probably be another tool in your toolbox. Learn how to use it to make your life and job easier

We’ll start with large language models (LLM) because most of us are familiar with them in some way. We’ll see how we can use LLMs in networking. We’ll also examine how AI/ML is used in different networking operations today or in the future and discuss some vendor products that use AI/ML.

Large Language Models (LLM)

Let’s start with LLM and how we can use these “manually.” Many of us have used at least ChatGPT sometimes. LLMs fall under the generative AI category and are useful to network engineers. They use “prompts” as input. Let’s take a look at how we can use this.




I’m a fan of Msty, which allows you to run LLMs in parallel. This allows you to use a single prompt and run, let’s say, OpenAI’s GPT and Anthropic’s Claude in parallel. It’s great to compare their responses and pick the best outcome.

Network Configurations

We can use an LLM to create network configurations. You can even let it generate the topology file if you use an emulator such as GNS3, Cisco CML, EVE-NG, or Containerlab. By creating a prompt with high-level requirements, we can generate topologies and configuration files. The outcome may not be perfect, but it gives you something to work with. Usually, it’s easier to have a rough draft that you can edit than trying to create something from scratch.

Code

Back in the day, all we used as network engineers was the CLI. Nowadays, we have some more coding to do. Many network devices can also be configured with an API, so you’ll need to write code to interact with it, such as with Python. We can also use an API to communicate with other products, such as the Cisco DNA center or the Cisco Meraki dashboard.

There’s also Infrastructure as Code (IaC) and CI/CD which is often configured with YAML or something else. For example, Terraform uses its HashiCorp Configuration Language (HCL).

Example

Let’s check an example of how Claude 3.5 can help with an Ansible playbook.

Prompt

Create Ansible playbook to create loopback0 interface with IP address 1.1.1.1/32 on Cisco IOS router.

Output

Here’s an Ansible playbook that creates a loopback0 interface with IP address 1.1.1.1/32 on a Cisco IOS router:

---
- name: Configure Loopback0 interface on Cisco IOS router
  hosts: cisco_routers
  gather_facts: no
  
  tasks:
    - name: Create Loopback0 interface
      cisco.ios.ios_interface:
        name: Loopback0
        description: Loopback Interface
        enabled: yes
        state: present

    - name: Configure IP address on Loopback0
      cisco.ios.ios_l3_interface:
        name: Loopback0
        ipv4: 1.1.1.1/32

    - name: Save the configuration
      cisco.ios.ios_config:
        save_when: always

This playbook does the following:

  • The playbook targets hosts in the cisco_routers group. Make sure you have defined this group in your Ansible inventory file.
  • gather_facts is set to no to speed up the playbook execution, as we don’t need to gather facts for this task.
  • The first task uses the cisco.ios.ios_interface module to create the Loopback0 interface, set a description, and ensure it’s enabled.
  • The second task uses the cisco.ios.ios_l3_interface module to configure the IP address 1.1.1.1/32 on the Loopback0 interface.
  • The final task uses the cisco.ios.ios_config module to save the configuration to the startup-config.
Be careful you don’t use an LLM as a “do my homework” solution. Make sure you understand what it generates and how it works, and when you run the code it generates, run it on a lab environment first.

Network Documentation

Network documentation is also a good example. This is a time-consuming and tedious task. The LLM can generate network documentation for you by supplying information about your network, such as network configurations, logs, and other relevant data.

As discussed in the AI/ML lesson, be aware of any privacy issues and by using a public LLM. This is an example where a local LLM might make more sense.

Show Commands

Network engineers frequently rely on “show” commands to get detailed information about the status and configuration of network devices. LLMs can interpret the output of these commands and explain what is going on. They might even come up with suggestions of what to check next or how to fix the issue. It can also be useful to ask an LLM for a list of show commands you can use to check certain things.

Example

Let’s see if GPT can come up with a list of show commands.

Prompt

For example, here is a prompt I used:

Give me a list of Cisco IOS show commands to check OSPF neighbor adjacencies.

Output

Certainly! Here is a list of Cisco IOS show commands you can use to check OSPF (Open Shortest Path First) neighbor adjacencies:

  • show ip ospf neighbor
    • This command provides a summary of OSPF neighbors, including their states, interface they are connected through, and more.
  • show ip ospf neighbor detail
    • This command gives detailed information about each OSPF neighbor, including full state details, uptime, and additional statistics.
  • show ip ospf interface brief
    • This command displays a summary of all interfaces running OSPF, which can help you quickly identify interfaces that should have OSPF neighbors.
  • show ip ospf interface [interface]
    • Replace [interface] with the specific interface name (e.g., GigabitEthernet0/1). This provides detailed information about the OSPF status on that particular interface, including neighbor relationships.
  • show ip ospf database
    • While not specifically focused on neighbors, this command shows the contents of the OSPF link-state database and can help verify if LSAs (Link-State Advertisements) from neighbors are being received correctly.
  • show ip ospf process [process-id] adjacency
    • Replace [process-id] with the specific OSPF process ID if you have multiple processes running. This command provides adjacency information for a specific OSPF process.
  • show ipv6 ospf neighbor (for IPv6)
    • If you’re working with OSPFv3 for IPv6, this command will provide similar information to show ip ospf neighbor, but for IPv6 neighbors.

This produces a useful list of show commands. It’s possible that some of these don’t work because LLMs can hallucinate, but it gives you an excellent list to start with.

Debug Outputs

Besides show commands, we also use debug commands to see what happens in real time. Some debug commands generate a lot of output and can be difficult to interpret. We can ask an LLM to explain what is going on.

Example

Let’s see if Claude can explain what is going on with a debug.

Prompt

Explain this debug to me:

R1#debug ip ospf hello
OSPF hello events debugging is on
R1#
OSPF: Send hello to 224.0.0.5 area 0 on FastEthernet0/0 from 192.168.12.1
OSPF: Rcv hello from 192.168.12.2 area 0 from FastEthernet0/0 192.168.12.2
OSPF: Mismatched hello parameters from 192.168.12.2
OSPF: Dead R 11 C 24, Hello R 10 C 6 Mask R 255.255.255.0 C 255.255.255.0
Output

Here’s the output from Claude:

This debug output shows OSPF (Open Shortest Path First) hello packet events on router R1. Let’s break it down line by line:

  1. R1#debug ip ospf hello: This command enables debugging for OSPF hello events.
  2. OSPF: Send hello to 224.0.0.5 area 0 on FastEthernet0/0 from 192.168.12.1:
    • R1 is sending a hello packet to the multicast address 224.0.0.5 (all OSPF routers).
    • It’s for area 0.
    • Sent from interface FastEthernet0/0 with IP 192.168.12.1.
  3. OSPF: Rcv hello from 192.168.12.2 area 0 from FastEthernet0/0 192.168.12.2:
    • R1 received a hello packet from IP 192.168.12.2.
    • It’s for area 0.
    • Received on interface FastEthernet0/0.
  4. OSPF: Mismatched hello parameters from 192.168.12.2:
    • There’s a mismatch in hello parameters between R1 and the router at 192.168.12.2.
  5. OSPF: Dead R 11 C 24, Hello R 10 C 6 Mask R 255.255.255.0 C 255.255.255.0:
    This line shows the mismatched parameters:

    • Dead interval: R (Received) is 11 seconds, C (Configured on R1) is 24 seconds.
    • Hello interval: R (Received) is 10 seconds, C (Configured on R1) is 6 seconds.
    • Subnet Mask: Both R (Received) and C (Configured) are 255.255.255.0, so this is not mismatched.

The main issue here is that the Dead and Hello intervals don’t match between the two routers. For OSPF to form an adjacency, these parameters must match. The router at 192.168.12.2 needs to be reconfigured to match the settings on R1, or vice versa, for OSPF to work correctly between these routers.

Limitations

Be careful when working with LLMs. Although they can produce amazing results at first sight, sometimes they can be flat-out wrong. LLMs are basically advanced word predictors, but they don’t understand what they say. They don’t have an inner understanding of networking and protocols. You are still the networking expert.

LLMs are great when you already have some knowledge of a subject. When it produces an output, you can quickly tell whether it looks ok and fact check it. Having the LLM generate something for you can be really helpful because it’s usually easier to start with something and check, edit, and revise it than starting from scratch.

Let’s say you are studying a new topic such as BGP. It can be useful to have some general discussions with an LLM and let it explain what BGP is, in what scenarios we use it, and so on. Asking it to generate configurations, show, and debug commands is also useful.

Once you ask questions about when BGP chooses a specific path, what happens when routers one and two fail simultaneously, and how failover occurs, you get on a slippery slope.  These questions require an inner understanding, and LLMs will usually fall short and tend to hallucinate.

Sometimes, it can help to use a Retrieval-Augmented Generation (RAG). In plain English, that means the LLM can use an external resource. For example, you could upload a PDF containing the information you want to work with, or you can let the LLM browse the Internet for the latest information.

AIOps and AI Networking

When you dive into AI and networking, you’ll run into two terms:

  • AIOps
  • AI Networking




AIOps, a term introduced by Gartner, refers to the use of AI and ML systems across all IT operations, including networking, servers, storage, and applications. These systems automate IT operations through event correlation, anomaly detection, and root cause analysis across the entire IT stack.

AI Networking, also coined by Gartner, is a subset of AIOps that specifically focuses on applying AI/ML to networking performance and operations.

AIOps and AI Networking usually mention three phases:

  • Day 0: Planning and configuration:
    • This is the initial design of an IT system. This phase involves capacity planning, hardware/software selection, network design (such as IP addressing schemes), security policies setup, etc.
  • Day 1: Onboarding and implementation:
    • This includes activities such as installing hardware components like servers or switches, configuring software, setting up virtual machines, integrating various systems, and implementing initial security protocols.
  • Day 2: Daily operation and optimization:
    • Monitoring for performance issues or security threat, routine upgrades, proactive maintenance, etc.

AIOps primarily concentrates on Day 0 and Day 1 activities, while AI Networking focuses more on Day 2 operations.

Application in Network Operations

Besides LLMs, let’s see how we can implement AI/ML in network operations. This should help make networks more efficient and increase the productivity of network engineers.

We’ll look at a general overview of what AI and ML could do for our networks. Some of these are available but could be limited in their capabilities. Some technology is in development, or we will might see it in the future.

Network Configuration

Initially, we configured networks using the CLI. Around the 2010s, SDN (Software-Defined Networking) was introduced. We moved toward programmable network management by separating the control and data plane.

In the late 2010s, SDN evolved into intent-based networking, which aims to automate network configurations based on a desired outcome.

The next phase is about integrating AI and ML into networking. AI and ML can automate routine network management tasks such as configuration changes, software updates, and monitoring, freeing up network engineers to focus on more complex and strategic activities.

Network Monitoring

The first networks were simple. We configured network devices manually through the CLI, and we could monitor them with show and debug commands.

SNMP (Simple Network Management Protocol) was developed in the 1980s to manage networks. We use SNMP to collect information from network devices and configure them. Syslog was also created in the 1980s and offers a standard for message logging to a central server.

In the 1990s, Cisco introduced NetFlow, which allows us to collect IP traffic information. In the late 1990s, NMSes (Network Management Systems) were developed to integrate multiple protocols, such as SNMP, Syslog, and Netflow, under one roof.

In the mid-2010s, we saw the rise of telemetry as a more advanced method than SNMP for collecting real-time data from network devices.

AI and ML are the next steps in this evolution. Algorithms can analyze historical data and current network conditions from network devices such as routers and switches to predict possible network failures. This can be useful for scheduling preventive maintenance and reducing downtime.

Network Optimization

Besides monitoring, AI and ML algorithms can analyze traffic patterns in real time, create congestion forecasts, and use this information to optimize routing decisions. This can help reduce latency, increase bandwidth, avoid bottlenecks, and prevent outages.

We can also use this for Quality of Service (QoS) management. QoS settings can be managed dynamically based on the current or forecasted load.

Network Security

AI and ML can also help with security. By analyzing and learning network traffic, AI/ML can identify and react to threats such as malware, DDoS, unauthorized access attempts and more. An AI/ML system can then immediately block malicious traffic, isolate infected devices, etc.

Device Tracking

AI enables precise device tracking, ensuring that all devices connected to the network are accounted for and secure. This is especially important in environments with a large number of IoT devices.

Network Troubleshooting

Network Troubleshooting can be time-consuming for network engineers. You’ll have to use show and debug commands and collect logging information, then go through everything to see what is going on. AI/ML models can help with root cause analysis. By analyzing data from multiple network devices, it’s possible to pinpoint possible issues quickly. An AI could share its findings with a network engineer, create possible fixes, or even attempt to fix the issue on its own. This makes troubleshooting quicker and reduces network downtime.

Generative AI

Besides using LLMs such as OpenAI’s GPT or Anthropic Claude through a GUI, we can also use generative AI in other scenarios.

For example, you could create a pipeline where a user defines a goal such as “Deploy multi area OSPF on these six routers”. You could use generative AI to create the initial device configurations and use pyATS to run validation tests. Tools such as Ansible or Nornir can deploy everything. By combining the inputs and outputs of LLMs, you can get very creative.

Products

Let’s also look at some vendor products that use AI and ML. Keep in mind that AI is a buzzword, and many vendors like to put an “AI sticker” on everything they own.

Cisco

Here’s an overview of some Cisco products that focus on AIOps and use AI and ML:

  • Cisco DNA Center (now Cisco Catalyst Center): Uses AI for detailed network visibility, enhancing security and management.
  • Cisco Nexus Dashboard: Incorporates Nexus Dashboard Insights (NDI) with AI and ML for event analytics. It helps in identifying issues and optimizing network performance.
  • Cisco Meraki: Utilizes ML algorithms for advanced WAN analytics. Its wireless solutions use AI to optimize Radio Resource Management (RRM), improving connectivity and simplifying management by leveraging historical RF data.
  • Meraki Video Analytics: Employs AI for video analytics and object detection.
  • Cisco AppDynamics: Features generative AI with an AI assistant to enhance application performance monitoring.
  • Cisco ThousandEyes: Offers cloud and SaaS visibility, using AI to analyze network measurements, provide recommendations, and automate actions to predict and resolve internet outages through digital experience assurance (DXA).
  • Cisco Secure Network Analytics (formerly Stealthwatch): Uses generative AI to analyze network data, identify anomalies, predict potential issues, and suggest remediation steps for efficient troubleshooting.

Others

We also have some examples from other vendors:

  • Juniper Mist AI and HPE Aruba AIOps: Leverage AI for enhanced network management and operations.
  • Check Point Infinity SOC: Uses AI and ML to manage and protect networks, cloud environments, and endpoints by identifying and mitigating security threats in real-time.
  • Palo Alto Networks Cortex XDR: Employs AI/ML for advanced threat detection and response.
  • Fortinet FortiAI: Utilizes AI to analyze threats and automate mitigation processes.
  • F5 BIG-IP DNS: Applies AI/ML to route DNS traffic efficiently, protect against DNS security threats, and ensure scalability.
  • Arista CloudVision AIOps: Uses AI/ML for root cause analysis by correlating network events and suggesting potential fixes.
  • Juniper Marvis Actions (powered by Mist AI): Automatically resolves issues or recommends actions requiring user intervention.

Conclusion

That’s all we have on AI and ML for networking. You have now learned:

  • How to use an LLM as an assistant to write code, analyze show and debug commands, create configurations, and more.
  • What AIOps and AI networking are about.
  • How we can use AI and ML in networking operations

I hope you enjoyed this lesson. If you have any questions, please leave a comment.