AwesomeOps Part 3: Windows Automation With Ansible And Terraform

Back by popular demand, it's AwesomeOps! Apparently our last installment of this series

titled "AwesomeOps Part 2: Demystifying Windows Automation with Ansible" went a little viral, so we decided to crank out another blog about automating Windows workloads. In this edition of AwesomeOps we will cover how to use Terraform to target your new Windows images, how we use Terraform with Ansible to configure hosts, and how to use Ansible to prevent drift configuration across all of your Windows hosts. We have a lot to cover here, so this will likely be continued in another installment!

Terraforming Compute Platforms

If you have been following along with our new AwesomeOps series, this installment will make a lot of sense. Now that you have become accustomed to working with WinRM, and hopefully you have baked a few delicious Windows images - yes it is not glamorous but you have to find the joy in the boring - you should now have some scrumptious baseline images to start terraforming with.

Now, look, there are hundreds of GitHub repositories and blogs out there that will show you exactly how to build virtual machines on all available compute platforms. The internet can be a great place 🙂! So, instead of reviewing the construction of a well formed Terraform module, we are going to focus on a couple of key points to get you rolling by referencing your new images and working through one method to integrate your modules with Ansible. Why do we need Ansible if we are using Terraform? Excellent question! Tool selection depends on a few things that will be unique to your organization:

Does your organization have budget to buy licenses.
Skill sets of you and your team. Is your team comprised of Infrastructure Engineers, developers, or a mix of both.
Do you have existing tooling that you should leverage, or you may be required to leverage.
What problems has your team been tasked with solving.

There are many more decision points when selecting tooling, these are just a few. Anyway, back to the why use Ansible with Terraform.

While Terraform is excellent at building things, it was not designed to configure operating systems. Don't get us wrong, you can use Terraform to configure operating systems, but it is not advisable. This is where Ansible comes in. Ansible was designed to configure operating systems, and is great at setting and ensuring system configuration post build. Now that we understand the why, let's dive into some practical engineering with diagrams and code.

The diagram above is intended to show how we think about automation. Interestingly enough, we think about automation the same way we approach security, and honestly most issues. It is all about layers. The bottom layer of our automation cake is Packer. In our previous post we went into how and why you want to use Packer to deliver identical Windows images across all of your compute platforms. But you will need to eventually draw the line between when and how much to bake vs fry (great article on bake vs. fry here). Fortunately and unfortunately for you that is unique to each organization. Once you have figured that out, you can design your cake. So how do we reference our new baked image in Terraform? Let's take a look at a couple of examples in Azure and VMware.

Azure has a service called Shared Image Gallery (check out more here). The shared image gallery is a simple and excellent service that allows you to version control your image! Version control is not limited to code. You should version control your images as well. By doing this you will be able to roll back images, pin your code to specific versions for legacy apps that NEED a particular version of Windows Server, and you will be able to have 1 image to rule them all for all of your Azure subscriptions! Yes, that is right, the shared image gallery allows you to centralize your images in one management subscription, and then share that image out to all enterprise subscriptions. Below is some Terraform that describes how to reference your images:

resource "azurerm_windows_virtual_machine" "vm" {
  count                      = var.vm_count
  name                       = var.is_name_overridden ? var.vm_hostname : "${lower(local.vm_name)}${lower(random_string.vm_suffix[count.index].result)}"
  location                   = var.compute_location
  resource_group_name        = local.rg_name
  size                       = local.instances[var.instance_type]["cloud_size"]
  admin_password             = random_string.password.result
  network_interface_ids      = [element(azurerm_network_interface.vm.*.id, count.index)]
  allow_extension_operations = var.allow_extension_operations
  zone                       = var.enable_zones ? element(["1", "2", "3"], count.index) : null
  admin_username             = "******"
  license_type               = var.license_type
  tags                       = module.tags.map
  timezone                   = var.timezone
  availability_set_id        = var.enable_zones ? null : azurerm_availability_set.vm[0].id
  source_image_id            = data.azurerm_shared_image_version.this.id

  os_disk {
    caching                   = var.storage_caching
    name                      = var.is_name_overridden ? "osdisk-${var.vm_hostname}-${random_string.vm_suffix[count.index].result}" : "osdisk-${lower(local.vm_name)}-${random_string.vm_suffix[count.index].result}"
    storage_account_type      = var.storage_account_type
    write_accelerator_enabled = var.storage_write_accelerator_enabled
  }

  lifecycle {
    ignore_changes = [
      name,
      os_disk[0].name,
      source_image_id
    ]
  }
}

There is a lot of configuration in the code above, however all you need to be concerned about is this line highlighted above in Blue.

source_image_id = data.azurerm_shared_image_version.this.id

What this is doing is using a Terraform data to get the id of the shared image gallery. The code for the data lookup is below:

data "azurerm_shared_image_version" "this" {
  provider = azurerm.shared_image_gallery_subscription
  name = var.mentat_shared_image_gallery_image_version_name
  image_name = local.mentat_shared_image_gallery_definition_name
  gallery_name = var.mentat_shared_image_gallery_name
  resource_group_name = var.mentat_shared_image_gallery_rg_name
  sort_versions_by_semver = true
}

Yes, we know, this is not properly formatted Terraform! :( We had to update the Terraform format to fit into the blog so it is more readable. Anyway, as you can see in the code block above there are a few attributes that you need to specify when performing this lookup. The most important for us was adding sort_versions_by_semver = true. This is set to false if you do not specify. The reason this was important for us is that we follow git semantic versioning in our Packerization process. So, technically, 9.0.0 is lexicographically later than 11.0.0. However, when we set sort_versions_by_semver to true, then 11.0.0 is later than 9.0.0. Very annoying, we know. The other big part here is this line of code in the data:

provider = azurerm.shared_image_gallery_subscription

This is a cool way of switching between subscriptions in Terraform. This is done because when working with multiple Azure subscriptions in an enterprise you will need to specify your management or hub subscription where your singular image gallery is located. The simple code for that is here:

provider "azurerm" {
  features {}
}

provider "azurerm" {
  features {}
  alias           = "shared_image_gallery_subscription"
  subscription_id = "your-subscription-here"
}

terraform {
  backend "azurerm" {}
}

With our process of Packer to Terraform and Azure Shared Image Gallery we can now easily build our Windows images into any Azure subscription within an organization, and we can target any version we want! Very cool stuff.

So what about VMware? Right on! Let's get into it. VMware actually has a similar process to Azure for sharing images across multiple vCenters. What you first need to do is create a content library. If you have never heard of this before, you can checkout all the details here. Once you have your content library setup, you will be able to subscribe that content library to multiple vCenters. This is really cool stuff. Now you can target 1 place in VMware to place your Packerized Windows images. This will allow for a single distribution point of your on-premises images.

Here is what a content library looks like. Just a note, you will want to click that checkbox to Enable authentication. To enable Packer to deploy to a content library you will need to add a block of code like this:

  content_library_destination {
    library = var.vcenter_content_library
    name    = "mentat-cis-windows-server-2022-${local.buildtime}"
    ovf     = false
    destroy = true
  }

Now to reference your new image in Terraform you will need to add this data block:

data "vsphere_virtual_machine" "template" {
  name          = var.vmtemp
  datacenter_id = data.vsphere_datacenter.dc.id
}

Within the virtual machine Terraform resource you will need to do something similar to the below:

resource "vsphere_virtual_machine" "this" {
  name = var.vm_name
  guest_id = data.vsphere_virtual_machine.template.guest_id
  scsi_type = data.vsphere_virtual_machine.template.scsi_type
  ...
network_interface {
  ...
}

disk {
  ...
  }
clone {
   template_uuid = data.vsphere_virtual_machine.template.id
  customize {
    ...
    }
  }
}

There is waaaayyyy more code to this module, so we opted to just highlight the one section to target where you will want to get your Packer images. And that is it for VMware targeting Packerized Windows images. One of the many cool things here is that the cloud vs on-premises environments have a similar process to placing images into one centralized image distribution point that can then be targeted by Terraform to pickup the version you specify.

But what about Ansible? That is a great segue into our next section on integrating Terraform with Ansible to ensure we have images fully ready when users first login.

Ansible & Eliminating Configuration Drift

Ahhhh the timeless Information Technology (IT) management problem of preventing configuration drift. Drift happens when organizations start to make the transition between manual effort to a DevOps mindset and operating environment. Administrators that were comfortable manually logging into servers and updating settings in the operating system to install an application or debug an issue may have a hard time adapting to automation. Most of this manual effort can be completely replaced with Ansible. Now, do not get us wrong, it is not simple to make this transition. The technology is always the easiest part! Sure there are technical headaches, but all of those headaches are predictable and solvable. The harder part of the transition is always the mindset shift. Unfortunately, we do not have days to write about how to successfully make the mindset shift to DevOps. Fortunately, we can give you a leg up on the technology side to help accelerate your automation and/or platform team.

Now that we have some Packerized Windows images, and we are able to target both our new on-premises and cloud images with Terraform, we can get into how to integrate Ansible with Terraform. On that note, onward to automation bliss.

Generally speaking there are two paths (yes there are infinite possibilities, but for the sake of time we cover 2) you can take when integrating Terraform and Ansible. The first path is to have the same agents Terraforming your infrastructure also execute your Ansible code. The second path is more advanced and only recommended if you have the budget and the DevOps chops to hack it😉. In this post we will show diagrams of both paths, but only cover the first path in code. DM if you have a question as to why.

Path 1 looks like this:

Path 2 looks like this:

Both paths eventually get you to automation bliss, but path 1 is simpler, less expensive, and is not as complex. So let's get into path 1.

One of the really cool things about Terraform is the ability to execute arbitrary code in parallel. We leverage this functionality to run Ansible post install playbooks on our nodes. Why would we need to run more Ansible if we already baked our Windows image with Packer and Ansible? Remember the cake 🍰 from earlier in this post? Well you cannot bake everything into your image. This post install runs Ansible code that layers on top of the image. Now that virtual machines are built, the environment has changed. Let's review what is happening here. To the code!

resource "null_resource" "windows_customization" {
  depends_on = [
    module.vmware_vms
  ]
  triggers = {
    pw         = random_string.random_creds[0].result
    ip         = module.vmware_vms.ip[count.index]
    vm         = module.vmware_vms.VM[count.index]
    always_run = "${timestamp()}"
  }
  count = var.is_windows_image ? var.instances : 0
  connection {
    type     = "winrm"
    host     = module.vmware_vms.ip[count.index]
    user     = local.base_vm_user
    password = local.base_vm_password
    use_ntlm = true
    port     = 5986
    https    = true
  }

  provisioner "file" {
    source      = "user_data/win_default.ps1"
    destination = "C:\\win_default.ps1"
  }

  provisioner "remote-exec" {
    inline = [
      "set USER=${local.base_vm_user}",
      "set PASSWORD=${random_string.random_creds[0].result}",
      "set HOSTNAME=${module.vmware_vms.VM[count.index]}",
      "powershell.exe -ExecutionPolicy Unrestricted -File C:\\win_default.ps1 *> C:\\user-data.log",
    ]
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    command     = "./scripts/windows-post-install.sh"
    environment = {
      REPO            = var.git_repo_windows
      BRANCH          = var.git_repo_branch_windows
      IP              = module.vmware_vms.ip[count.index]
      USER            = local.base_vm_user
      PW              = "${random_string.random_creds[0].result}"
    }
  }
}

There is a lot going on here, so we will cut to the chase. The last bit of the code uses a provisioner called local-exec (more on that here). All this does is execute our script named windows-post-install, and sets up variables in the environment section to pass down to the executed script. The executed script is where the Ansible magic happens. See below most of the code.

#!/bin/bash
set -e

echo "Executing post-installation script..."

git clone yourrepohere/$REPO $REPO-$IP
cd $REPO-$IP
mkdir -p .ansible/roles
git checkout $BRANCH

export ANSIBLE_COLLECTIONS_PATHS=./collections
export ANSIBLE_ROLES_PATH=./.ansible/roles
export ANSIBLE_HOST_KEY_CHECKING=False

# This venv that is being cloned is defined in the agent
virtualenv-clone /azp/ansible-azure-venv venv
source venv/bin/activate
pip3 install jmespath
pip3 install -r requirements.txt
ansible-galaxy collection install --force -r ./collections/requirements.yml
ansible-galaxy role install --force --role-file ./roles/requirements.yml
ansible-galaxy list
(set -x; ansible-playbook -i $IP, ansible_win_base.yml --extra-vars "ansible_python_interpreter=python3 windows=true ansible_password=$PW ansible_user=$USER" -vv)

deactivate
cd ..
rm -fr $REPO-$IP

What this script is doing is first preparing to run Ansible. In order to do that you will need to set up variables using export. The exports above setup our collections and roles so Ansible knows where to go for certain things. Next we create a virtual environment with venv. Once we have a new venv in place we use ansible galaxy to install collections and roles. Lastly, we run the Ansible playbook titled ansible_win_base. At this point you may be asking yourself, "hey, what is going on with that funky syntax and set -x;". Great question! Set -x is used so that we can see all of the Ansible task execution within the context of the pipeline. This way we see the output of Terraform to validate that the infrastructure was built correctly, and we can see all parallel Ansible task execution running across all of our provisioned hosts.

Now we have fully provisioned virtual machines with another layered Ansible step to set foundational configurations that will be ensured by a scheduled pipeline running the same exact playbook! The other huge upside here is that we have identical images and identically built machines across all major compute platforms!

Next Time On AwesomeOps

We will pickup where we left off and discuss how we create an Ansible golden inventory in git that is then used by our scheduled Ansible ensure role to prevent configuration drift!

AwesomeOps Part 3: Windows Automation With Ansible And Terraform

Terraforming Compute Platforms

Ansible & Eliminating Configuration Drift

Next Time On AwesomeOps

Recent Posts