Have you woken up in the morning with a sinking feeling in the pit of your stomach? Then, suddenly, your phone and watch start vibrating and ringing off the hook because all of your Terraform pipelines are failing!?!?
We definitely have experienced this before, which is why a couple of years ago we developed the Mentat Early Warning System (MEWS). The idea is really simple. Every single Terraform and Ansible module we build, operate, and maintain gets a CI pipeline that automatically executes each night. When a module pipeline fails it posts a message to a dedicated Slack channel with a link back to the pipeline error. It looks like this:
This system has caught many one off errors before, however last night MEWS uncovered a doozy of an issue. All Azure Gov Terraform pipelines started failing and posting to our Slack channel around 11:45PM. So our engineers immediately dove in to investigate where the problem was. After a few minutes we found that this update caused the issue:
A small typo in the word "management" caused a failure in all Azure Gov Terraform modules that were not pinned to a specific version. The awesome team over at Hashi quickly resolved the issue and merged into the mainline branch here https://github.com/hashicorp/terraform-provider-azurerm/blob/main/vendor/github.com/hashicorp/go-azure-sdk/sdk/environments/azure_gov.go#L15.
What was really cool is that we woke up this morning to this nice message:
Shortly after this message we found our Terraform pipelines had started working again.
A big shout out to our monster engineer MSnook for identifying and commenting on the breaking change very early this morning.
Wrapping Up:
The moral of the story is, if you do not have an early warning system in place for ALL of your Terraform modules, you should definitely invest time to get an automated testing harness in place. Also, shameless plug, if you need help getting a system like MEWS in place at your organization, hit us up at info@mentat.cloud or any of our social media handles.
Comments