From Chaos to Clarity: Build an Efficient Data Lake Through Smart Automation and Tags

Mentat Collective
Aug 24, 2023
6 min read

Welcome to another edition of AwesomeOps!! In this blog of automation bliss we will cover a solution we designed years ago to help clients create simple, secure, and scalable data lakes capable of visualizing all types of technical environments spread across public and private clouds. Now, we know what you are thinking. "Don't we need dozens of tools, teams

of engineers, and a large cash outlay just to get an effective and efficient data lake up and running?" The answer is no! You do not need any of that as long as you have a great design. In this post we will being the discussion on how to create a simple, secure, scalable data lake. All you need is a sprinkle of tags, Elasticsearch, and a little bit of Kafka to spark a compute optimization metamorphosis in your organization. See what we did there. :) And now, to the tech!!

What The Data Lake?

We know you are itching to learn about the design, but like most things some context is needed before we jump in. So, what is a data lake? What is it used for? Why is it needed? These are all great questions, so let's address them one at a time.

A data lake is a system that ingests data from as many sources as there are grains of sand on Jones Beach. Or, at least, as many internal and external systems that run within your environment. More specifically, a data lake is intended to store both structured and unstructured data without the need for predefined schemas. Think storing raw machine logs along with images and application telemetry and vendor management data and IoT device data. Now storing all of this data in one place may not initially make much sense, however the magic of the data lake is the ability to create insights out of data from disparate systems that having seemingly 0 connections. This is very different from a data mart and different from a data warehouse, which use predefined schemas and relationships between tables of structured data. We are generalizing here a bit, but you get the point.

So, given data lakes do not have predefined relationships in data, how do we create proactive and valuable business insights? We are glad you asked!! We need a little bit of smart automation combined with tags. Tags are automation glue. Plain and simple. Instead of creating predefined schemas that link structured data together, we will show some clever ways of using tags to create linkages within unstructured and structured data within a disparate set of systems.

Tags are Automation Magic

Around 2013 AWS released tags as a way to manage cloud assets, and since that time tags have become a powerful way to classify and categorize assets in and outside of the cloud. Tags, and related metadata, have become so popular that all modern systems worth their weight have some concept of tags. Tags come in all different and wonderful forms. Kubernetes, for example, implements tags as Labels and Annotations. Within Kubernetes labels are a fundamental part of how resources are organized, while annotations are intended for additional metadata and information that might be relevant to tools, automation, or human operators. And finally, one of the foundational aspects of most successful AI models on the market today is the concept of data tagging. And this is a perfect transition into why everyone needs to be tagging data, systems, and services.

Since the data within a data lake is essentially schemaless until read, and the nature of a data lake is to hold data from hundreds or thousands of disparate and unrelated data sources, tags act as a sort of tech glue that can bind all of your data together to create relationships that you never knew existed. Relationships that will help your organization visualize and proactively detect issues up and down the Open Systems Interconnection (OSI) stack. In order to implement this practically, the approach we have taken is the creation of a centralized tagging module that is then called by ALL automation. This means that every single piece of automation calls a single module as it builds or configures a system. To understand the power of this simple design decision, let's look at a practical example, the creation of a Virtual Machine.

Virtual Machines (VM) do NOT exist in isolation:

Each VM is running on physical hardware. Think large servers with loads of CPU, RAM, and Disk.
Each VM transmits data across virtual and physical networks. Think routers and switches, as well as software defined networks like Cisco ACI or NSX-T.
Each VM has records added to 1 or many external systems. Think ServiceNow CMDB.
Each VM needs to be registered with 1 or many external systems. Think Crowdstrike or an Antivirus system or RedHat Insights.
Each VM will need talk to internal shared services that are used to centrally manage VMs. Think Active Directory for Windows hosts or DHCP systems responsible for providing IPs to VMs as they come online.
Each VM is classified into a particular environment. Think development vs. testing vs. production environment classification.

To keep things simple in this example, let's just say in our made up world each VM has 2 parts for each item identified above. That means each VM will be related to 12 systems. In most cases each system has the ability to hold and maintain metadata (tags). What this means is that during the automated build and configuration process where your automation touches all 12 systems, you will be able to send the same set of tags related to the VM to all 12 systems! Once all 12 of these systems contain the same set of tags describing your VM, you can then create data associations between all of the data generated or collected by the 12 disparate and disconnected systems by querying tags in your data lake. Forget correlation of data by machine name or IP addresses. Now you can ask much simpler questions within your data lake and get back significantly more valuable information about your entire IT ecosystem. A simple comparative example.

You do NOT use tags and instead associate data by IP or hostname. Well in this example you are fairly limited in that you only have a single metric to see data from all 12 systems about your single VM. And many of those systems are likely to have difference fields holding hostname, and perhaps the hostnames are not identical because one system uses short name and another uses a fully qualified domain name (FQDN).
You DO use tags. Now you can ask questions like, get my 'dev' environment. WHOA!! That is significantly more powerful because you can now see any piece of data across all 12 systems that has 'dev'. This is where the magic of tags happens! We just created tons of relationships between completely unrelated sets of data.

The Big Picture and Architecture Design

Below is a use case architecture we have been using for a few years to provide clients value by visualizing their environment in meaningful ways.

In this modern era of Artificial Intelligence (AI) and Machine Learning (ML) the most important ideas are not the tools and methods used in the creation of the awesome new tech! The most important aspect of those tools is the underlying foundation of how data gets from the bottom of the OSI stack to the top of the stack where users interface with a cool web page. The idea here is very basic. All organizations need insights into all aspects of how their business operates. And since technology drives business, observability in information technology is critical to business success. In order to get full visibility into your organization, you will need to be able to collect, track, and visualize every frame of every packet as it moves from the physical layer up to the application layer. Observing data as it moves up the stack and through your environment is incredibly important to creating a stable, scalable, and secure enterprise services. To put it in simple terms, picture throwing a stone into a pond. The initial wave ring is small, however the rings grow in size as they begin to propagate outward creating larger and larger waves.

This is no difference within IT systems. Issues that occur within the smallest footprint of the physical layer of the OSI stack translate to massive waves at the top of the stack, which ultimately results in poor user experience in various internal line of business (LOB) and external public facing applications. By implementing tags within a data lake you and your team will be able to trace and track issues at each layer of your tech stack. This will ultimately help reduce outages, improve customer experience, lower time to resolution, and save time and money.

Next Time On AwesomeOps

We will release part 2 of this multipart series on how to create an efficient data lake that scales and is secure. Part 2 will focus on how you can maintain a data lake backed by Elasticsearch with only a few engineers, some Kafka, Elastic ILMs, external storage, and tags.

Hopefully this blog series has been helpful to everyone who reads it and is on the journey of automation. As Kafka said in the metamorphosis "But instead of the five short steps of his former life, there were now an infinite number of long ones to be taken." Change is inevitable and painful. However, once you start down a new path a whole new infinite world opens up to you. ;) Good luck with your data lake!

From Chaos to Clarity: Build an Efficient Data Lake Through Smart Automation and Tags

What The Data Lake?

Tags are Automation Magic

The Big Picture and Architecture Design

Next Time On AwesomeOps

Recent Posts

Comments