Detecting and visualizing common errors and misconfigurations across your network
Introduction
All too often technology managers believe that if you use the best of breed tools nothing can go wrong. I mean it makes sense, right? Pick the best, buy the best, implement the best and you should have a fairly easy job managing your infrastructure. That logically does follow and makes a ton of sense on the surface. However, this is far from a tangible reality. So, what do we do? To mitigate failures it is important to approach the full lifecycle of any system with the mindset of anything that can go wrong will go wrong (Murphy’s Law). Werner Vogels, CTO - Amazon.com said it best back in good ole 2016:
“Failures are a given and everything will eventually fail over time: from routers to
hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures. This is a given, whether you are using the highest-quality hardware or lowest cost components. This becomes an even more important lesson at scale.”
Ok, great. So we plan for failure! But what does that practically mean? Thankfully Mr. Vogels provided guidance on this point to at least set us on the right track:
“build systems that embrace failure as a natural occurrence even if we did not know what the failure might be. Systems need to keep running even if the “house is on fire.” It is important to be able to manage pieces that are impacted without the need to take the overall system down.”
At this point some of you are probably thinking, these are great ideas but honestly I don’t know where or how to start. Ok, fair. I have espoused a few theories and some basic ideas of how to approach failure, but have not addressed where to start. So, with that in mind, let’s get going with some tech talk.
The first thing that you need to do when planning for failure is visibility! You need to be able to see what is going on before you can create a solution. You cannot fix what you cannot see.
Let’s take a look at a recent client problem, and the solution that we designed and implemented to improve visibility.
The Problem:
One of our clients approached us to see if we could create some automation to help fix a slew of network issues. For years the client had experienced slow network performance, frequent outages, and every time they fixed one issue another one cropped up. The client was in the unenviable position of fighting fires 24/7 and never locating and remediating the root cause of the network issues.
Solution Tech Stack:
Azure DevOps - The Orchestration Engine.
Azure Key Vault - The Secret Keeper.
Azure Kubernetes - The Work Horse.
Elasticsearch - The Storage Engine.
Kibana - The Visualizer.
Ansible - The Configuration Code.
Terraform - The Builder.
The Plan:
Before we could hope to create network automation solutions we needed visibility, we needed some telemetry! We needed data insights before creating code. Without eyes on data and tools that enable trend analysis and machine learning, we would end up in the same position as the client always writing automation code to fix random issues. First, we decided to use Azure DevOps (ADO) as the orchestration engine because it was natively Azure Active Directory (AAD) integrated and the client was a heavy Azure user, so all of their domain accounts and groups were already synced to AAD. With ADO we would be able to create pipelines with approval gates to process work and schedule automated work. Next we needed a place to store all secret and semi-secret data, so we quickly settled on Azure Key Vault because it was Azure based and there are simple tasks in ADO pipelines that enable you to quickly and safely retrieve secrets and pass those to a tool that authenticates to network devices. Next we needed a reliable, secure, and scalable engine to process work on a daily/hourly basis. We selected Azure Kubernetes Service (AKS) to run pipeline jobs. Now that we determined the orchestration engine, secrets storage and retrieval, and job processing service we moved on to select Ansible as the tool to perform the actual work of connecting to network devices and collecting data. Ansible was a simple choice as it is relatively simple to use and there were already a series of GiHub repos with fully baked code capable of connecting to Cisco devices and collecting and formatting data pulled off those devices. Finally we selected Elasticsearch and Kibana as the tool chain where we could reliably land data and create fast and simple visualizations to get our network telemetry.
Getting Things Wired Up:
We used ADO pipelines with our pipeline templates running Terraform to deploy the base Azure services because we maintain a large number of internal Terraform repositories. Terraform built out the Azure Key Vault and AKS blue/green clusters all integrated with the client private spoke VNET. One day I am sure we will create a post about the ins and outs of how we deploy all of the base infrastructure with pipeline templates and Terraform in minutes, but that is a little outside the scope of this post. :( That said, here is a little taste of what a pipeline looks like.
---
# Terraform release pipeline
variables:
- ${{ if or(eq(parameters.environment, 'dev'), eq(parameters.environment, 'tst')) }}:
- group: terraform-variables-dev
- ${{ if eq(parameters.environment, 'stg') }}:
- group: terraform-variables-stg
- ${{ if eq(parameters.environment, 'prd') }}:
- group: terraform-variables-prod
- name: download_tf_vars
value: 'true'
- name: tf_directory
value: $(System.DefaultWorkingDirectory)/$(agency)/$(provider)/${{ parameters.environment }}/$(tf_module)
- name: TF_VAR_TF_STORAGE_ACCOUNT_NAME
value: terraform${{ parameters.environment }}
- name: TF_VAR_TF_STORAGE_KEY_SUFFIX
value: -$(tag)-$(environment)-${{ parameters.statefile_suffix }}
- name: provider
value: azure
- name: environment
value: ${{ parameters.environment }}
- name: tf_module
value: map-base
parameters:
- name: subscription
type: string
default: azure-core-infrastructure
values:
- azure-core-infrastructure
- azure-services
- name: environment
displayName: environment
type: string
default: dev
values:
- dev
- tst
- stg
- prd
- name: statefile_suffix
displayName: Identifier of the statefile.
type: string
default: '1'
- name: destroy_validation
displayName: Type DestroyMe (case sensitive) if intending to run the destroy stage on a PRD environment.
type: string
default: N/A
resources:
repositories:
- repository: templates
type: git
name: platform/pipeline-templates
ref: master
pool: $(pool)
# Release Stages
stages:
- stage: Variables
jobs:
- job:
steps:
- template: ../common-pipeline-yamls/terraform-tfvars-azure.yaml
- task: PublishPipelineArtifact@1
inputs:
targetPath: $(tf_directory)/terraform.tfvars
artifactName: terraform_parameter_vars_$(tf_module)
- template: terragrunt/terragrunt-preflight-check.yaml@templates
- template: terragrunt/terragrunt-plan.yaml@templates
- template: terragrunt/terragrunt-cost-estimate.yaml@templates
- template: terragrunt/terragrunt-apply.yaml@templates
The pipeline above is separated into a few different parts. There are variables, which are not visible to end users running pipelines, and are a combination of what you see in the pipeline and the key value pairs called in from the variable groups. Then you see parameters, which are values that end users can see when running the pipeline. After that we see a resources section that basically clones our pipeline templates repo. Pipeline templates you can think of like functions in a traditional programming language like python. After resources are called in, we specify the AKS pool where the work will be executed. And the last step you see are the release stages. The release stages contain both our dynamic tfvars file creation and pipeline templates.
Once the orchestration engine components were in place and functional, we deployed Elasticsearch into an OpenShift (It is RedHat’s implementation of Kubernetes) cluster on-premises. Now for anyone that has managed Elastic at scale knows the system gets unwieldy quickly when you start pushing a lot of data. So, because we had a small team with limited knowledge of Elastic we decided to deploy into Kubernetes so we get the benefits of rapid deployment, scaling, and infrastructure as code. Sort of like Ron Popeil “set it and forget it” for Elasticsearch. However, in order to enable a truly late night Ron experience we had to put everything in code. So we set up the index schema, kibana dashboards, backup configurations, ILM policies, and users and groups, and much much more all in code!
And, just like everything else we code up, we used pipelines and our orchestration engine to run rapid deployments of Elastic into OpenShift. If you are interested in deploying Elasticsearch into Kubernetes checkout the Helm charts here https://github.com/elastic/helm-charts.
Hrm maybe we will do a blog just on this topic this year.
Anyway, back to the solution!
Now that our automation services were deployed and we had a data landing zone in Elasticsearch, we moved onto Ansible! We used the following repos as a base and then forked the code to meet the specific needs of the client.
Our main playbook file ended up looking like this:
---
- name: Get Network Structured Data
hosts: "{{ target_hosts | default(omit) }}"
gather_facts: false
connection: network_cli
vars:
elastic_user: "{{ lookup('env', 'ELASTICSEARCH_USERNAME') }}"
elastic_pass: "{{ lookup('env', 'ELASTICSEARCH_PASSWORD') }}"
tasks:
- name: Read in Pyats Role
include_role:
name: ansible-pyats
- name: Read in Parse Genie Role
include_role:
name: parse_genie
- name: Execute Show Interfaces - IOS
ios_command:
commands: show interfaces
register: cli_output_ios
when: ansible_network_os != 'nxos'
- name: Read Interfaces - IOS
set_fact:
parse_output_ios: "{{ cli_output_ios['stdout'][0] | parse_genie(command='show interfaces', os=ansible_network_os) }}"
when: ansible_network_os != 'nxos'
- name: Push to Elasticsearch - IOS
uri:
url: "{{ elastic_uri }}/{{ el_index }}/_doc"
body: "{{ {'@timestamp': '%Y-%m-%dT%H:%M:%S' | strftime ,'DEVICE': inventory_hostname, 'PORT': item.key ,'PORT_DESCRIPTION': item.value.description | default(''), 'CRCS': item.value.counters.in_crc_errors, 'COUNTERS_CLEARED': item.value.counters.last_clear, 'DATE': '%m/%d/%Y' | strftime } | to_json }}"
body_format: json
method: POST
user: "{{ elastic_user }}"
status_code:
- 201
password: "{{ elastic_pass }}"
validate_certs: "{{ validate_certs }}"
when:
- ansible_network_os != 'nxos'
- item.value.counters is defined
- item.value.counters.in_crc_errors is defined
- item.value.counters.in_crc_errors != 0
loop: "{{ parse_output_ios | dict2items }}"
loop_control:
label: "{{ item.key }}"
delegate_to: localhost
become: false
Now this is not the entire file. We can’t give everything away! :) The basics here are:
We call in the roles we forked
Pass in the Elasticsearch username and password secretly
Read and show interface data
Push that data to Elasticsearch
We popped this into a pipeline with a few Mentat special pipeline templates and let it rip. The playbook then connects to thousands of devices multiple times a day, reads CRCs and other data, pushes that data into Elasticsearch where users then see the data visualized with Kibana dashboards.
Outcomes
Significantly reduced network outages and bordering on elimination. This was accomplished because the network teams were able to see and track issues within the network between and within device types. This allowed the client to identify a number of underlying issues that would have been impossible to detect without telemetry. One example is they found that dozens of a particular device type delivered within a set window of time were all physically defective and needed to be replaced! They found loads of issues like this within the network and were able to use dashboards to identify and fix those issues.
Greater visibility of network issues in near real time. This enabled the client to become proactive about network issues instead of reactive.
Increased performance. Performance naturally increased as more and more issues were identified and remediated.
Faster identification of misconfiguration in the network and hardware defects
Numerous misconfigurations were identified and fixed.
New misconfigurations are now identified in near real time and resolved with automation.
Improved quality of life for all teams at the client
No more all day or multi-day conference calls to fix a network issue
More sleep and less stress
Comments