A DevOps Workflow, Part 3: Deployment

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Congratulations, your code looks good! Now all you need to do is put your application in front of your users to discover all the creative ways they'll break it. In order to do this, we'll have to create our instances, configure them, and deploy our code.

CloudFormation: Infrastructure Definition

Amazon Web Services (AWS) is a common target for HumanGeo deployments. Traditionally, when one creates resources on AWS, one uses the management console interface. While this is a good way to experiment with an environment, it cannot be automated, nor can it be managed under version control. Amazon, recognizing that the web console is insufficient for serious provisioning and scaling purposes, provides a series of tools for application deployment. The one that best fits our needs is CloudFormation.

CloudFormation allows you to define your infrastructure as a collection of JSON objects. For example, an EC2 instance can be declared with the following block:

"ElasticSearchInstance": {
    "Properties": {
        "EbsOptimized": "true",
        "ImageId": { "Ref": "ImageId" },
        "InstanceType": { "Ref": "EsInstanceType" },
        "KeyName": { "Ref": "KeyName" },
        "NetworkInterfaces": [{
            "DeviceIndex": "0",
            "GroupSet": [
                { "Ref": "ElasticsearchSecurityGroup" },
                { "Ref": "SSHSecurityGroup" }
            ],
            "PrivateIpAddress": "10.0.0.11",
            "SubnetId": { "Ref": "Subnet" }
        }],
        "Tags": [{
            "Key": "Application",
            "Value": { "Ref": "AWS::StackId" }
        }, {
            "Key": "Class",
            "Value": "project-es"
        }, {
            "Key": "Name",
            "Value": "project-es01"
        }]
    },
    "Type": "AWS::EC2::Instance"
}

If you're familiar with EC2, much of the above should make sense to you. Fields with Ref objects are cross-references to other resources in the CloudFormation stack - both siblings and parameters. Once written, the JSON document can be uploaded to AWS and then run. What's really cool here is that we can do this with an Ansible task!

Since we prefer to maintain a separation between our instance provisioning and cloud provisioning scripts, our CloudFormation tasks usually reside in a standalone playbook named amazon.yml.

- name: Apply the CloudFormation template
  cloudformation:
    stack_name: proj_name
    state: present
    region: "us-east-1"
    template: "files/project-cfn.json"
    template_parameters:
      KeyName: project-key
      EsInstanceType: "r3.large"
      ImageId: "ami-d05e75b8"
    tags:
      Stack: "project-core"

This will not only upload the stack template to AWS, but it also will instantiate the stack with the provided parameters, which can be either constants or Ansible variables. The world is your oyster! Unlike with other AWS wrappers, CloudFormation is stateful, storing stack identifiers and only updating what needs to be updated.

After working with CloudFormation at scale, some warts really started getting to us - many of which stemmed from the fact that the templating language is JSON. Updating a template is painful. Its usage of strings instead of variables makes validation difficult, and there can be a significant amount of repetition if you have several similar resources. Thankfully, there exists a solution in the form of the awesome Python library troposphere. It provides a way to write a CloudFormation template in Python, with all the benefits of a full programming language. The tropospheric equivalent of our Elasticsearch stack is:

t = Template()
t.add_resource(Instance(
    'ElasticSearchInstance',
    IamInstanceProfile=Ref(es_iam_instance_profile),
    ImageId=Ref(ami_id),
    InstanceType=Ref(es_instance_type),
    KeyName=Ref(key_name),
    Tags=Tags(Application=Ref(stack_id), Name='project-es01', Class='project-es'),
    NetworkInterfaces=[NetworkInterfaceProperty(
        GroupSet=[Ref(ssh_sg), Ref(es_sg)],
        AssociatePublicIpAddress=True,
        DeviceIndex='0',
        DeleteOnTermination=True,
        SubnetId=Ref(subnet),
        PrivateIpAddress='10.0.0.11',
    )],
    EbsOptimized=False,)))
print(t.to_json())

Since we're using bare variable names, we can use static analysis tools like Pylint to validate the template. Additionally, now everything can be scripted! Want to make multiple instances with the same configuration? With JSON, you were stuck copy-pasting the same chunks of text multiple times. With troposphere, it's just a matter of wrapping the instance definition in a function and invoking it multiple times.

When you're ready to apply your template, simply execute it to get CloudFormation-compatible JSON, and you're good to go.

Provisioning: Ansible

Ansible was discussed in the local development post, but here it is again! Assuming you were principled in writing your local development playbook, aiming at the AWS cloud is pretty straightforward.

First, you'll need to make Ansible aware of your cloud instances. Sure, you could manually define the host IPs in your inventory, but that means you'll have to manually update the mapping of hosts to IP addresses any time you need to recreate an instance. If only there was some way to dynamically target these machines...

Thinking

There is! Ansible provides a means to use a dynamic inventory backed by AWS. Once you have your credentials configured, you can use any set of EC2 attributes to target your resources. Since we tend to provision clusters of machines in addition to standalone instances, it'd be nice to have a more general attribute selector than tag_Name_project_es01. This can be accomplished by applying our own ontology to our EC2 instances using tags. Notice the Class tag in the CloudFormation examples above. While every Elasticsearch instance we deploy will have a different Name tag, they'll all share a Class tag of tag_Class_project_es. Get in the habit of using the project name as a prefix everywhere since tags are global for your account.

When using the dynamic inventory, plays look like this:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - common
    - es

With that, ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem will target all EC2 instances with a Class tag of project-es and apply the common and es roles.

One other aspect of your cloud deployment is that it may require secrets. You may need to store passwords for an emailer or private keys for encrypted RabbitMQ channels. Under normal circumstances, these wouldn't (and shouldn't) be stored in version control. However, once again our good pal Ansible swings by to help us out. Enter, Ansible Vault.

For this example, we want to manage SMTP credentials using Ansible. First, let's create a secrets file: ansible-vault create vars/secrets.yml

You'll be prompted for a password. Remember this, as it's the only way you can decrypt the file. Now, lets add the variables to our file:

smtp_username: smtp_user@mydomain.biz
smtp_password: s3cur3!

Save and exit. Ansible Vault will automatically encrypt the contents. Now, you can reference the secrets in your playbook:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - common
    - es
  vars_files:
    - vars/secrets.yml

Your roles don't need to know that the variables are encrypted, you can reference them just as you would any other variable. Decryption happens at runtime, and requires an additional argument be provided: ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem --ask-vault-pass. When run, Ansible will prompt you for the vault password. If it's correct, the file will be decrypted and the variables will be available for your tasks. If the password is incorrect, you're dropped back to the console and prompted to try again.

Continuous Delivery: Jenkins

At this point, we're now able to fully provision our cloud deployments with Ansible. Most teams stop here, but we take it even further. It's immensely useful to developers and other stakeholders to see just how things are shaping up, and catch any undetected integration bugs. To achieve this, we turn once more to our friends, Ansible and Jenkins, to create a test environment for us.

First, in order to prevent bugs in test from affecting the production environment, we must instantiate a standalone test environment. It's up to you to determine just how substantial this abstraction will be. For our purposes, we'll be creating separate EC2 instances, but nothing else. This is where troposphere once more proves its worth. We can wrap the generation of the relevant stack components behind a function that returns either a set of "prod" resources, or "test" ones. For example:

def get_instance(*, is_production):
    suffix = '-test' if not is_production else ''
    return Instance(
        f'MyInstance{suffix}',
        # edited for brevity
        Tags=Tags(Application=Ref(stack_id),
                  Name=f'project-instance{suffix}',
                  Class=f'project-instance{suffix}'),
        NetworkInterfaces=[NetworkInterfaceProperty(
            PrivateIpAddress=f'10.0.0.{1 + (10 if is_production else 100)}',
        )],)))

t.add_resource(get_instance(is_production=True))
t.add_resource(get_instance(is_production=False))

Once those resources are instantiated, it's a simple matter of tweaking your production playbook to support test-environment specific features - e.g., enable verbose logging and debug mode.

This gets you most of the way to a fully automated test environment. However, there is the small matter of what actually triggers the Ansible deployment. Jenkins comes to the rescue… again. Since we're deploying from our main develop branch, we can set up a downstream project that provisions the test instance as follows:

  1. Install the Ansible plugin for Jenkins
  2. Create a new project.
    Project creation prompt
  3. Have it run only when the project that periodically validates our central branch succeeds.
    Project trigger prompt
  4. When this project runs, it should invoke the Ansible playbook.
    Ansible playbook prompt

You will have to make your private key available to Jenkins. If this concerns you, you can also provision an additional set of private keys solely for Jenkins, so, if you need to revoke access in the future, you don't have to go through the hassle of creating new private keys for the main account.

Now, when someone pushes to your development branch and the code is satisfactory (i.e., passing unit tests and lint checks), Jenkins will update the contents of your test server. Pure developer-driven architecture!

You can take this one step further and have Jenkins do something similar for your production environment, just triggered differently. This could utilize a GitFlow-like branching strategy where the master branch contains production-quality code, so updates there trigger a production deployment. Jenkins is pretty flexible, so, more often than not, you can contrive a combination of triggers and preconditions that will downselect to the conditions that you want to cause a deployment.

Monitoring

Congrats! Now that you've got a fully automated and tested workflow, what are you going to do?

Disneyworld

Nope. You've got this masterpiece, yes, but how do you know it's actually running? When one manually deploys code, one usually follows up by clicking around on the site, checking out system load, etc. Jenkins does nothing of the sort, and neither does your application. Without manually checking, how do you know that the last commit didn't have an overly verbose method that's spammed your system logs to the point you don't have any free space? Or, how do you know that Supervisord wasn't misconfigured and is actually stopped? Much like with Schrödinger's cat, you don't. Time to start monitoring our stack.

Thankfully, commercial and free monitoring solutions can do this for you (except for Nagios - please stop). If your disk is full, send an email to the ops team. If you're starting to see some slow queries manifest themselves, nag the nerds in your #developers Slack channel. The tool we've moved to from Nagios is Sensu.

Sensu utilizes a client-server model, where the central server coordinates checks across a variety of nodes, and nodes run a client service that executes checks and collects metrics. Each client-side heartbeat sends data to the central server, which manages alerting states, etc. We also run Uchiwa, which provides a great dashboard for managing your monitoring environment.

Internally, we run a single Sensu server, to which all of our clients connect via an encrypted RabbitMQ channel. The encryption keys (and configuration) are deployed to the clients via a common sensu Ansible Vault-protected role. What varies from deployment to deployment are the checks that are carried out. Within our playbooks, we define various subscriptions for a set of machines…

- name: Monitor the collector instance
  hosts: tag_Class_project_collector
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - role: sensu
      subscriptions: [ disk, collector, elasticsearch ]

We then customize the block of tasks that provision our checks, using when clauses that test the subscriptions collection:

- name: check elasticsearch cluster health
  sensu_check:
    name: elastic-health
    command: "{{ plugins_base_path }}/check-es-cluster-status.rb -h {{ hostvars[groups['tag_Name_project_es01'][0]]['ec2_private_ip_address'] }}"
    handlers: project-mailer
    standalone: yes
    interval: 300
  notify: restart sensu-client service
  when: "'elasticsearch' in subscriptions"

There are two awesome things going on here. The first is that Ansible ships with a Sensu check module. This is far nicer than maintaining our own templated JSON. Additionally, check out that hostvars statement! This taps into the power of the AWS dynamic inventory to allow for attribute lookups.

That's It

Whew! After all this, what does our final provisioning setup look like?

.
├── amazon.yml                 # Provision our AWS infrastructure
├── files                      # Top-level provisioning resources
│   ├── Makefile               # Compiles the troposphere script
│   ├── project-cfn.py         # troposphere script
│   ├── project-stack.json     # Compiled troposphere script
├── group_vars
│   └── all
│       └── vars.yml           # Global Ansible vars
├── inventory
│   ├── base
│   ├── ec2.ini
│   └── ec2.py                 # Ansible AWS discovery script
├── production-monitoring.yml  # Provision infrastructure with Sensu software
├── production.yml             # Provision infrastructure with software
└── roles
    ├── cloudformation         # Applies the CloudFormation template with our parameters
    │   └── tasks
    │       └── main.yml
    ├── es
    │   ├── defaults
    │   │   └── main.yml       # Default ES variables, can be overridden by plays
    │   ├── tasks
    │   │   ├── main.yml       # ES provisioning tasks
    │   └── templates
    │       ├── elasticsearch.yml.j2 # Jinja2 ES config template
    │       ├── kibana.yml.j2  # Kibana config template
    │       └── nginx.conf.j2  # NGINX config template
    └── sensu
        ├── defaults
        │   └── main.yml       # Ansible Vault encrypted Sensu certificates and passwords
        ├── files
        │   └── sudoers-sensu  # A config file to be applied on some systems. Not templated.
        ├── handlers
        │   └── main.yml       # Tasks that are triggered by changes within Sensu
        ├── tasks
        │   ├── checks.yml     # Tasks that are activated on each target based on subscriptions
        │   └── main.yml       # Baseline Sensu client installation
        ├── templates
        │   ├── client.json.j2      # Local Sensu client configutation template
        │   └── rabbitmq.json.j2    # RabbitMQ configuration to communicate
        └── vars
            ├── main.yml            # Base Sensu client configuration
            └── sensu_rabbitmq.yml  # RabbitMQ configuration

While what's been presented in this series may seem imposing, it really isn't if you take it one step at a time. Everything builds off the previous work, and even if you only implement a subset of the solution presented, you still get to reap the rewards of a better-managed, more consistent stack.

A DevOps Workflow, Part 2: Continuous Integration

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Look at you – all fancy with your consistent and easily-managed development environment. However, that's only half of the local development puzzle. Sure, now developers can no longer use "it works on my machine" as an excuse, but all that means is they know that something runs. Without validation, your artisanal ramen may be indistinguishable from burned spaghetti. This is where unit testing and continuous integration really prove their worth.

Unit Testing

You can't swing a dead cat without hitting a billion different Medium posts and Hacker News articles about the One True Way to do testing. Protip: there isn't one. I prefer Test Driven Development (TDD), as it helps me design for failure as I build features. Others prefer to write tests after the fact, because it forces them to take a second pass over a chunk of functionality. All that matters is that you have and maintain tests. If you're feeling really professional, you should make test coverage a requirement for any and all code that is intended for production. Regardless, code verification through linting and tests is a vital part of a good DevOps culture.

Getting Started

Writing a test is easy. For Python, a preferred language at HumanGeo, there exist many different test frameworks and tools. One great option is pytest. It allows you to associate test classes with your code without boilerplate. For example:

# my_code.py

def get_country(country_code):
    return COUNTRIES.get(country_code)
# test_my_code.py

import my_code

def test_get_country(): # All tests start with 'test_'
    assert my_code.get_country('DE') == 'Germany'

When executed, the output will indicate success:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py .

========================== 1 passed in 0.01 seconds ===============================

or failure:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py F

==================================== FAILURES =====================================
________________________________ test_get_country _________________________________

    def test_get_country():
>       assert my_code.get_country('DE') == 'Germany'
E       assert 'Denmark' == 'Germany'
E         - Denmark
E         + Germany

test_code.py:6: AssertionError
============================ 1 failed in 0.03 seconds =============================

The inlining of failing code frames makes it easy to pinpoint the failing assertion, thus reducing unit testing headaches and boilerplate. For more on pytest, check out Jacob Kaplan-Moss's great introduction to the library.

Mocking

Mocking is vital part of the testing equation. I don't mean making fun of your tests (that would be downright rude), but instead substituting fake (mock) objects in place of ones that serve as touchpoints to external code. This is nice because a good test shouldn't care about certain implementation details - just ensure that all cases are correctly handled. This especially holds true when relying on components outside of the purview of your application, such as web services, datastores, or the filesystem.

unittest.mock is my library of choice. To see how it's used, let's dive into an example:

# my_code.py

def country_data_exists():
    return os.path.exists('/tmp/countries.json')
# test_my_code.py

from unittest.mock import patch
import my_code

@patch('os.path.exists')
def test_country_data_exists_success(path_exists_mock):
    path_exists_mock.return_value = True
    data_exists = my_code.country_data_exists()
    assert data_exists == True
    path_exists_mock.assert_called_once_with('/tmp/countries.json')

@patch('os.path.exists')
def test_country_data_exists_failure(path_exists_mock):
    path_exists_mock.return_value = False
    data_exists = my_code.country_data_exists()
    assert data_exists == False
    path_exists_mock.assert_called_once_with('/tmp/countries.json')

The patch function replaces the object at the provided path with a Mock object. These objects use Python magic to accept arbitrary calls and return defined values. Once the function that uses the mocked object has been invoked, we can inspect the mock and make various assertions about how it was called.

If you're using the Requests library (which you should always do), responses allows you to intercept specific requests and return custom data:

# my_code.py

import requests

def get_flag_image(country_code):
    response = requests.get(f'http://example.com/flags/{country_code}.png')
    if not response.ok:
        raise MediaDownloadError(f'Error downloading the image: HTTP {response.status_code}:\n{response.text}')
    return response.content
# test_my_code.py

import pytest
import responses

@responses.activate # Tell responses to intercept this function's requests
def test_get_flag_image_404():
    responses.add(responses.GET, # The HTTP method to intercept
                  'http://example.com/flags/de.gif', # The URL to intercept
                  body="These aren't the gifs you're looking for", # The mocked response body
                  status=404) # The mocked response status
    with pytest.raises(my_code.MediaDownloadError) as download_error:
        my_code.get_flag_image('de')
    assert '404' in download_error.message

More information on mocking in Python can be found here.

Continuous Integration: Jenkins

Throughout this process, we've been trusting our developers when they say their code works locally without issue. The better approach here is to trust, but verify. From bad merges to broad refactors, a host of issues can manifest themselves during the last few phases of task development. A good DevOps culture accepts that these are inevitable and must be addressed through automation. The practice of validating the most recent version of your codebase is called Continuous Integration (CI).

For this, we will use Jenkins, a popular open source tool designed for flexible CI workflows. It has a large community that provides plugins for integration with common tools, such as GitLab, Python Virtual Environments, and various test runners.

Once Jenkins has access to your GitLab instance, it can:

  1. Poll for merge requests targeting the main development branch;
    Jenkins polling for MR screenshot
  2. Attempt a merge of the feature branch into the trunk;
    Jenkins MR detection screenshot
  3. Run your linter;
    Jenkins lint command screenshot
    • Define your acceptable lint severity thresholds
      Jenkins polling for MR screenshot
  4. Run unit tests; and
    Jenkins unit test screenshot
  5. If any of the above steps result in a failure state, Jenkins will comment on the MR. Otherwise, the build is good, and the MR is given the green light.

By integrating Jenkins CI with GitLab merge requests, low quality code can be detected and addressed before it enters your main branch. Newer versions of Jenkins even provide for defining your CI workflow as a file hosted within your repository. This way, your pipeline always corresponds to your codebase. GitLab has also launched a CI capability that may also fit your needs.

This concludes the continuous integration portion of our dive into DevOps. In the next installment, we'll cover deployment!

A DevOps Workflow, Part 1: Local Development

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

How many times have you heard: "That's weird - it works on my machine?"

How often has a new employee's first task turned into a days-long effort, roping in several developers and revealing a surprising number of undocumented requirements, broken links and nondeterministic operations?

How often has a release gone south due to stovepiped knowledge, missing dependencies, and poor documentation?

In my experience, if you put a dollar into a swear jar whenever one of the above happened, plenty of people would be retiring early to spend time on their private islands. The fact that this situation exists is a huge problem.

What would an ideal solution look like? It should ensure consistency of environments, capture external dependencies, manage configuration, be self-documenting, allow for rapid iteration, and be as automated as possible. These features - the intersection of development and operations - make up the practice of DevOps. The solution shouldn't suck for your team - you need to maximize buy-in, and that can't be done when people need to fight container daemons and provisioning scripts every time they rebase to master.

In this series, I'll be walking through how we do DevOps at HumanGeo. Our strategy consists of three phases - local development, continuous integration, and deployment.

Please note that, while I mention specific technologies, I'm not stating that this is The One True Way™. We encourage our teams to experiment with new tools and methods, so this series presents a model that several teams have implemented with success, not official developer guidelines.

Development Environment: Vagrant

In order to best capture external dependencies, one should start with a blank slate. Thankfully, this doesn't mean a developer has to format her computer each time she takes on a new project. Depending on the project, it may be as simple as putting code into a new directory or creating a new virtual environment. However, given the scale of the problems we tackle at HumanGeo, we need to push even further and assemble specific combinations of databases, Elasticsearch nodes, Hadoop clusters, and other bespoke installations. To do so, we need to create sandboxed instances of the aforementioned tools; it's the only sane way to juggle multiple versions of a product when developing locally. There are plenty of fine solutions to this problem, Docker and Vagrant being two of the major players. There's not a perfect overlap between the two, but as they fit in our stack, they're near-equivalent. Since it provides a gentler learning curve, this series will cover Vagrant.

Vagrant provides a means for creating and managing portable development environments. Typically, these reside in VirtualBox virtual machines, although they have support for many different backend providers. What's neat is that, with a single Vagrantfile, you can provision and connect multiple VMs, while automatically syncing code changes made on the host machine (i.e., your computer) to the guest instance (i.e., the Vagrant box).

To get started with Vagrant, you must define your configuration in a Vagrantfile. Here's a sample:

Vagrant.configure("2") do |config|
  config.vm.box = "trusty64"
  config.vm.hostname = "webserver"
  config.vm.network :private_network, ip: "192.168.0.42"

  config.vm.provider :virtualbox do |vb|
    vb.customize [
      "modifyvm", :id,
      "--memory", "256",
    ]
  end

  config.vm.provision :shell, path: "bootstrap.sh"
end

This defines an Ubuntu 14.04 (Trusty Tahr) machine with a fixed private IP, 256mb of RAM, and a bootstrap shell script, which will install needed dependencies and apply software-level configuration. The Vagrantfile can be committed to version control alongside the bootstrap script and your application code so the entire environment can be captured in a single snapshot.

Launching the machine is done with a single command: vagrant up. Vagrant will download the trusty64 base image from a central repository, launch a new instance of it with the hardware and networking states we've defined, and then run the bootstrap file. The image download will only occur once-per image, so future machine initializations will utilize the cached version. Machines can be stopped with vagrant down. You can later re-launch the machine with vagrant up. If you decide that you need to nuke your entire environment from orbit and start over (an immensely useful option), you can do so with vagrant destroy.

To manage these machines, one can connect via SSH just as one would a remote server. The vagrant ssh command will automatically log the user in using public key authentication. From there, a developer can experiment with configuration and other aspects of application development. All ports are exposed to the host machine, so, if a webserver is bound to port 5000, it can be reached from your browser at http://192.168.0.42:5000 (the IP address we assigned to our instance in the Vagrantfile).

Unlike when working with a remote server, you don't need to run a terminal-based editor via SSH, or use rsync every time you save a file in order to make changes to the code on the virtual machine. Instead, the directory that contains your Vagrantfile is automatically mounted as /vagrant/ on the guest, with changes automatically synced back and forth. So, you can use whatever editor you want on the host, while executing code on the VM. Easy.

Provisioning: Ansible

Vagrant itself is only really focused on the orchestration of virtual machines; the configuration of the machines is outside of its purview. As such, it relies on a provisioner - an external tool or script that runs against newly created virtual machines in order to build upon the base image. For example, a provisioner would be responsible for taking a blank Ubuntu installation and installing PostgreSQL, initializing a database, and seeding the database with data.

The example Vagrantfile uses a simple shell script (bootstrap.sh) to handle provisioning. For simple cases, this may well be sufficient. However, if you're doing any serious development, you'll want to move to a more robust configuration management tool. Vagrant ships with support for several different ones, including our preferred tool - Ansible.

Ansible is great in many ways: its YAML-based configuration language is clean and logical, it operates over SSH, has a great community, emphasizes modularity, and doesn't require any custom software be present on your target computers (other than Python 2, with Python 3 support in the technical preview phase). With a little elbow grease, you can even make it idempotent, so there's nothing to fear if you reprovision an instance. Since these provisioning scripts live alongside your code, they can be included in your merge review process, and improve validation of your infrastructure.

Swapping out Vagrant's shell provisioner is extremely straightforward. Just change your provisioner to "ansible", point it at the Ansible configuration script (called a playbook), and you're set! The final provisioning block should now look like this:

config.vm.provision "ansible" do |ansible|
  ansible.playbook = "playbook.yml"
end

Tasks

Ansible's basic building block is a task. Conceptually, a task is an atomic operation. These operations run the gamut from the basic (e.g., set the permissions on a file) to the complex (e.g., create a database table). Here's a sample task:

- name: Install database
  apt: name=mysql-server state=present

The equivalent shell command would be sudo apt-get install mysql-server. Nothing fancy, right?

- name: Deploy DB config
  copy: src=mysql.{{env_name}}.conf dest=/etc/mysql.conf mode=644

There are several things going on here. First, surprise! Ansible is awesome and speaks Jinja2. As such, it will interpolate the variable env_name into the string value for src, resulting in mysql.dev.conf if we were targeting a dev environment (env_name is a convention we use internally for this very purpose). Next, we're invoking the copy module. This doesn't actually copy a file from one remote location to another, it instead copies a local file to a remote destination. This saves you from having to scp the file to your target machine, then remote in to set a permission. It's also far easier to understand at a glance.

- name: Start mysqld
  service: name=mysql state=started enabled=yes

Finally, we ensure that the MySQL service is not only running, but is set to automatically start when the system does. This highlights one of the benefits of Ansible's module system - it masks (and handles) underlying implementation complexities. Whether or not the target machine is using SysV-style inits, Upstart, or systemd, the service module takes care of it for you.

Roles

Tasks can either reside in your playbook, or they can be organized into functional units called roles. Roles not only allow you to group tasks, but also bundle files, templates and other resources, providing for a clean separation of concerns. The tasks above can be placed in a file called tasks/main.yml, resulting in the following directory structure:

roles
└── mysql                   # Tasks to be carried out on DB machines
    ├── files
    │   ├── mysql.dev.conf
    │   └── mysql.prod.conf
    └── tasks
        └── main.yml

Then, all you need to do is reference the role from within your playbook.

Playbooks

These are the entry points for Ansible. A playbook is comprised of one or more plays, each of which possesses several parameters: one or more instances to target, variables to bundle, and a series of tasks (or roles) to execute.

- name: Configure the test environment app server
  hosts: 192.168.1.1
  vars:
    env_name: dev
    es_version: 2.1.0
  roles:
    - common
    - elasticsearch
    - mysql

What is evident in the above example is how Ansible roles help improve modularity and reusability. If I have to install MySQL on several different hosts (e.g., the test app server and the production app server), all I need to do is include the role. Ansible maintains a central repository of roles for developers to customize; most of the time you don't need to write any novel provisioning code.

To invoke the playbook, run ansible-playbook name-of-playbook.yml. If you're using Ansible with Vagrant, you should instead use vagrant provision, as Vagrant will handle the mapping of hosts and authentication. And, no matter how many times you provision, the machine state should remain the same.

This concludes the local development portion of our dive into DevOps. In the next installment, we'll cover continuous integration!