Alpaca AI Security Scanning

If you are an information technology (IT) executive like me, you have - hopefully - spent far too many nights sitting and thinking about the hundreds/thousands of code repositories updated and maintained by dozens of internal teams and external consultants and if any of those repositories have secrets that will compromise and eventually sink your business. At Mentat we did something about it so our clients and we can sleep better at night. About 2 years ago we scaled out gitleaks to iterate over all repositories, identify secrets based on a TOML file of common regex secret patterns, push the pipeline results to Kafka and then Elasticsearch, and finally we used Kibana to visualize a breakdown of possible repository leaks by team and codebase. It looked like this:

The initial results of this solution helped us to do the following:

- Cleanup a good number of secrets in all of the repositories across dozens of teams using BFG

- Visualize which teams had the highest number of secrets and what types of secrets. This in turn enabled us to teach teams how to appropriately manage secrets

- Introduce an automated process to stop users from committing potential secrets to mainline branches

- Increase security by iterating on our secrets management solution of Azure Key Vault and Azure pipelines to make it easier and safer for developers and admins to manage and use secrets.

This was all great! However, we are continuously improving and found a lot of potential issues with leveraging a tool tied to static regex patterns. Because of this we found ourselves constantly and manually searching repositories for new types of secret and semi-secret data to create new regex patterns. This was a massive time sink. So, what to do? Well, we decided to create Alpaca AI Security. Alpaca AI is an artificial intelligence tool that dynamically adapts to the ever changing landscape of secret and semi-secret data within your organization. Side note, we started with Git but expanded to all sorts of documents. Checkout more here. Anyway, back to business. Alpaca AI combines the power of natural language processing (NLP) and machine learning to dynamically adapt to environments without the need for static regex patterns. Once we trained the model on millions of git commits, we set it to work by batching all repos we manage to run dozens of concurrent scans across hundreds of repos on a nightly basis. Sample pipeline output looks like this:

Each pipeline run has a batch of repos to scan. Each scan runs across hundreds of repos, thousands of branches, and millions of commits. As seen above, this one pipeline scanned 217 thousand commits in about 5 minutes and found 65 potential secret leaks. This is really impressive and it is a huge time saver. The other really nice thing about Alpaca AI is that we simply swapped out gitleaks for Alpaca AI so all data continues to flow from pipeline to eventually Elasticsearch where users are able to see the hashed out values of semi-secret data! Some of the benefits are:

- No need to search repos for possible secrets in order to create new regex patterns to search. This saves us hundreds of hours of work a year.

- Faster repo scans

- Dynamic model that learns and gets better over time

- Ability to find new types of secrets and semi-secret data

- Ability to scan different document types outside of code

- Integrated into our automation platform, yet flexible enough to rapidly deploy to other systems in days not months

- Automatically runs on PR/MRs so we can protect mainline branches

- Coding/scripting language agnostic so any codebase can be scanned

If you are interested in learning more about our automation services and/or Alpaca AI Security, reach out to our team to learn more and get a demo.