Elasticsearch Frustration: The Curious Query

Last year I was poking at an Elasticsearch cluster to review the indexed data and verify that things were healthy. It was all good until I stumbled upon this weird document:

{
  "_version": 1,
  "_index": "events",
  "_type": "event",
  "_id": "_query",
  "_score": 1,
  "_source": {
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "date_created": {
                "gte": "2016-01-01"
              }
            }
          }
        ]
      }
    }
  }
}

It may not be immediately obvious what's going on in the above snippet. Instead of a valid event document, there's a document with a query as the contents. Additionally, the document ID appears to be _query instead of the expected GUID. The combination of these two irregularities makes it seem as if someone accidentally posted a query to the wrong endpoint. No problem, just delete the document, right?

DELETE /events/event/_query
ActionRequestValidationException[Validation Failed: 1: source is missing;]

Wat.

I reached out to some of my coworkers to see if they could point me in the right direction, but all that I received was an (unhelpful) "I've seen this error before, and we solved it, but no one seems to remember how it was done." Great.

After much head-scratching, it turns out that, since the ID is _query, Elasticsearch's URL router thinks that I'm trying to issue a query and validates the HTTP action as such. Part of that validation is the requirement that queries have a body. Oops.

While passing an empty object should conceivably have worked, I wanted to play things extra safe in case ES was executing the query (this was production, after all (why are you looking at me like that?)), so I passed in a query object that constrained the results to only the problematic document.

DELETE /compositeevents/compositeevent/_query
{
    "query": {
        "match": {
            "_id": "_query"
        }
    }
}

... and the document was deleted successfully! Hopefully putting this to blog form will help others who encounter it in the future (including me).

A DevOps Workflow, Part 3: Deployment

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Congratulations, your code looks good! Now all you need to do is put your application in front of your users to discover all the creative ways they'll break it. In order to do this, we'll have to create our instances, configure them, and deploy our code.

CloudFormation: Infrastructure Definition

Amazon Web Services (AWS) is a common target for HumanGeo deployments. Traditionally, when one creates resources on AWS, one uses the management console interface. While this is a good way to experiment with an environment, it cannot be automated, nor can it be managed under version control. Amazon, recognizing that the web console is insufficient for serious provisioning and scaling purposes, provides a series of tools for application deployment. The one that best fits our needs is CloudFormation.

CloudFormation allows you to define your infrastructure as a collection of JSON objects. For example, an EC2 instance can be declared with the following block:

"ElasticSearchInstance": {
    "Properties": {
        "EbsOptimized": "true",
        "ImageId": { "Ref": "ImageId" },
        "InstanceType": { "Ref": "EsInstanceType" },
        "KeyName": { "Ref": "KeyName" },
        "NetworkInterfaces": [{
            "DeviceIndex": "0",
            "GroupSet": [
                { "Ref": "ElasticsearchSecurityGroup" },
                { "Ref": "SSHSecurityGroup" }
            ],
            "PrivateIpAddress": "10.0.0.11",
            "SubnetId": { "Ref": "Subnet" }
        }],
        "Tags": [{
            "Key": "Application",
            "Value": { "Ref": "AWS::StackId" }
        }, {
            "Key": "Class",
            "Value": "project-es"
        }, {
            "Key": "Name",
            "Value": "project-es01"
        }]
    },
    "Type": "AWS::EC2::Instance"
}

If you're familiar with EC2, much of the above should make sense to you. Fields with Ref objects are cross-references to other resources in the CloudFormation stack - both siblings and parameters. Once written, the JSON document can be uploaded to AWS and then run. What's really cool here is that we can do this with an Ansible task!

Since we prefer to maintain a separation between our instance provisioning and cloud provisioning scripts, our CloudFormation tasks usually reside in a standalone playbook named amazon.yml.

- name: Apply the CloudFormation template
  cloudformation:
    stack_name: proj_name
    state: present
    region: "us-east-1"
    template: "files/project-cfn.json"
    template_parameters:
      KeyName: project-key
      EsInstanceType: "r3.large"
      ImageId: "ami-d05e75b8"
    tags:
      Stack: "project-core"

This will not only upload the stack template to AWS, but it also will instantiate the stack with the provided parameters, which can be either constants or Ansible variables. The world is your oyster! Unlike with other AWS wrappers, CloudFormation is stateful, storing stack identifiers and only updating what needs to be updated.

After working with CloudFormation at scale, some warts really started getting to us - many of which stemmed from the fact that the templating language is JSON. Updating a template is painful. Its usage of strings instead of variables makes validation difficult, and there can be a significant amount of repetition if you have several similar resources. Thankfully, there exists a solution in the form of the awesome Python library troposphere. It provides a way to write a CloudFormation template in Python, with all the benefits of a full programming language. The tropospheric equivalent of our Elasticsearch stack is:

t = Template()
t.add_resource(Instance(
    'ElasticSearchInstance',
    IamInstanceProfile=Ref(es_iam_instance_profile),
    ImageId=Ref(ami_id),
    InstanceType=Ref(es_instance_type),
    KeyName=Ref(key_name),
    Tags=Tags(Application=Ref(stack_id), Name='project-es01', Class='project-es'),
    NetworkInterfaces=[NetworkInterfaceProperty(
        GroupSet=[Ref(ssh_sg), Ref(es_sg)],
        AssociatePublicIpAddress=True,
        DeviceIndex='0',
        DeleteOnTermination=True,
        SubnetId=Ref(subnet),
        PrivateIpAddress='10.0.0.11',
    )],
    EbsOptimized=False,)))
print(t.to_json())

Since we're using bare variable names, we can use static analysis tools like Pylint to validate the template. Additionally, now everything can be scripted! Want to make multiple instances with the same configuration? With JSON, you were stuck copy-pasting the same chunks of text multiple times. With troposphere, it's just a matter of wrapping the instance definition in a function and invoking it multiple times.

When you're ready to apply your template, simply execute it to get CloudFormation-compatible JSON, and you're good to go.

Provisioning: Ansible

Ansible was discussed in the local development post, but here it is again! Assuming you were principled in writing your local development playbook, aiming at the AWS cloud is pretty straightforward.

First, you'll need to make Ansible aware of your cloud instances. Sure, you could manually define the host IPs in your inventory, but that means you'll have to manually update the mapping of hosts to IP addresses any time you need to recreate an instance. If only there was some way to dynamically target these machines...

Thinking

There is! Ansible provides a means to use a dynamic inventory backed by AWS. Once you have your credentials configured, you can use any set of EC2 attributes to target your resources. Since we tend to provision clusters of machines in addition to standalone instances, it'd be nice to have a more general attribute selector than tag_Name_project_es01. This can be accomplished by applying our own ontology to our EC2 instances using tags. Notice the Class tag in the CloudFormation examples above. While every Elasticsearch instance we deploy will have a different Name tag, they'll all share a Class tag of tag_Class_project_es. Get in the habit of using the project name as a prefix everywhere since tags are global for your account.

When using the dynamic inventory, plays look like this:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - common
    - es

With that, ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem will target all EC2 instances with a Class tag of project-es and apply the common and es roles.

One other aspect of your cloud deployment is that it may require secrets. You may need to store passwords for an emailer or private keys for encrypted RabbitMQ channels. Under normal circumstances, these wouldn't (and shouldn't) be stored in version control. However, once again our good pal Ansible swings by to help us out. Enter, Ansible Vault.

For this example, we want to manage SMTP credentials using Ansible. First, let's create a secrets file: ansible-vault create vars/secrets.yml

You'll be prompted for a password. Remember this, as it's the only way you can decrypt the file. Now, lets add the variables to our file:

smtp_username: smtp_user@mydomain.biz
smtp_password: s3cur3!

Save and exit. Ansible Vault will automatically encrypt the contents. Now, you can reference the secrets in your playbook:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - common
    - es
  vars_files:
    - vars/secrets.yml

Your roles don't need to know that the variables are encrypted, you can reference them just as you would any other variable. Decryption happens at runtime, and requires an additional argument be provided: ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem --ask-vault-pass. When run, Ansible will prompt you for the vault password. If it's correct, the file will be decrypted and the variables will be available for your tasks. If the password is incorrect, you're dropped back to the console and prompted to try again.

Continuous Delivery: Jenkins

At this point, we're now able to fully provision our cloud deployments with Ansible. Most teams stop here, but we take it even further. It's immensely useful to developers and other stakeholders to see just how things are shaping up, and catch any undetected integration bugs. To achieve this, we turn once more to our friends, Ansible and Jenkins, to create a test environment for us.

First, in order to prevent bugs in test from affecting the production environment, we must instantiate a standalone test environment. It's up to you to determine just how substantial this abstraction will be. For our purposes, we'll be creating separate EC2 instances, but nothing else. This is where troposphere once more proves its worth. We can wrap the generation of the relevant stack components behind a function that returns either a set of "prod" resources, or "test" ones. For example:

def get_instance(*, is_production):
    suffix = '-test' if not is_production else ''
    return Instance(
        f'MyInstance{suffix}',
        # edited for brevity
        Tags=Tags(Application=Ref(stack_id),
                  Name=f'project-instance{suffix}',
                  Class=f'project-instance{suffix}'),
        NetworkInterfaces=[NetworkInterfaceProperty(
            PrivateIpAddress=f'10.0.0.{1 + (10 if is_production else 100)}',
        )],)))

t.add_resource(get_instance(is_production=True))
t.add_resource(get_instance(is_production=False))

Once those resources are instantiated, it's a simple matter of tweaking your production playbook to support test-environment specific features - e.g., enable verbose logging and debug mode.

This gets you most of the way to a fully automated test environment. However, there is the small matter of what actually triggers the Ansible deployment. Jenkins comes to the rescue… again. Since we're deploying from our main develop branch, we can set up a downstream project that provisions the test instance as follows:

  1. Install the Ansible plugin for Jenkins
  2. Create a new project.
    Project creation prompt
  3. Have it run only when the project that periodically validates our central branch succeeds.
    Project trigger prompt
  4. When this project runs, it should invoke the Ansible playbook.
    Ansible playbook prompt

You will have to make your private key available to Jenkins. If this concerns you, you can also provision an additional set of private keys solely for Jenkins, so, if you need to revoke access in the future, you don't have to go through the hassle of creating new private keys for the main account.

Now, when someone pushes to your development branch and the code is satisfactory (i.e., passing unit tests and lint checks), Jenkins will update the contents of your test server. Pure developer-driven architecture!

You can take this one step further and have Jenkins do something similar for your production environment, just triggered differently. This could utilize a GitFlow-like branching strategy where the master branch contains production-quality code, so updates there trigger a production deployment. Jenkins is pretty flexible, so, more often than not, you can contrive a combination of triggers and preconditions that will downselect to the conditions that you want to cause a deployment.

Monitoring

Congrats! Now that you've got a fully automated and tested workflow, what are you going to do?

Disneyworld

Nope. You've got this masterpiece, yes, but how do you know it's actually running? When one manually deploys code, one usually follows up by clicking around on the site, checking out system load, etc. Jenkins does nothing of the sort, and neither does your application. Without manually checking, how do you know that the last commit didn't have an overly verbose method that's spammed your system logs to the point you don't have any free space? Or, how do you know that Supervisord wasn't misconfigured and is actually stopped? Much like with Schrödinger's cat, you don't. Time to start monitoring our stack.

Thankfully, commercial and free monitoring solutions can do this for you (except for Nagios - please stop). If your disk is full, send an email to the ops team. If you're starting to see some slow queries manifest themselves, nag the nerds in your #developers Slack channel. The tool we've moved to from Nagios is Sensu.

Sensu utilizes a client-server model, where the central server coordinates checks across a variety of nodes, and nodes run a client service that executes checks and collects metrics. Each client-side heartbeat sends data to the central server, which manages alerting states, etc. We also run Uchiwa, which provides a great dashboard for managing your monitoring environment.

Internally, we run a single Sensu server, to which all of our clients connect via an encrypted RabbitMQ channel. The encryption keys (and configuration) are deployed to the clients via a common sensu Ansible Vault-protected role. What varies from deployment to deployment are the checks that are carried out. Within our playbooks, we define various subscriptions for a set of machines…

- name: Monitor the collector instance
  hosts: tag_Class_project_collector
  remote_user: ubuntu
  become: yes
  become_method: sudo
  roles:
    - role: sensu
      subscriptions: [ disk, collector, elasticsearch ]

We then customize the block of tasks that provision our checks, using when clauses that test the subscriptions collection:

- name: check elasticsearch cluster health
  sensu_check:
    name: elastic-health
    command: "{{ plugins_base_path }}/check-es-cluster-status.rb -h {{ hostvars[groups['tag_Name_project_es01'][0]]['ec2_private_ip_address'] }}"
    handlers: project-mailer
    standalone: yes
    interval: 300
  notify: restart sensu-client service
  when: "'elasticsearch' in subscriptions"

There are two awesome things going on here. The first is that Ansible ships with a Sensu check module. This is far nicer than maintaining our own templated JSON. Additionally, check out that hostvars statement! This taps into the power of the AWS dynamic inventory to allow for attribute lookups.

That's It

Whew! After all this, what does our final provisioning setup look like?

.
├── amazon.yml                 # Provision our AWS infrastructure
├── files                      # Top-level provisioning resources
│   ├── Makefile               # Compiles the troposphere script
│   ├── project-cfn.py         # troposphere script
│   ├── project-stack.json     # Compiled troposphere script
├── group_vars
│   └── all
│       └── vars.yml           # Global Ansible vars
├── inventory
│   ├── base
│   ├── ec2.ini
│   └── ec2.py                 # Ansible AWS discovery script
├── production-monitoring.yml  # Provision infrastructure with Sensu software
├── production.yml             # Provision infrastructure with software
└── roles
    ├── cloudformation         # Applies the CloudFormation template with our parameters
    │   └── tasks
    │       └── main.yml
    ├── es
    │   ├── defaults
    │   │   └── main.yml       # Default ES variables, can be overridden by plays
    │   ├── tasks
    │   │   ├── main.yml       # ES provisioning tasks
    │   └── templates
    │       ├── elasticsearch.yml.j2 # Jinja2 ES config template
    │       ├── kibana.yml.j2  # Kibana config template
    │       └── nginx.conf.j2  # NGINX config template
    └── sensu
        ├── defaults
        │   └── main.yml       # Ansible Vault encrypted Sensu certificates and passwords
        ├── files
        │   └── sudoers-sensu  # A config file to be applied on some systems. Not templated.
        ├── handlers
        │   └── main.yml       # Tasks that are triggered by changes within Sensu
        ├── tasks
        │   ├── checks.yml     # Tasks that are activated on each target based on subscriptions
        │   └── main.yml       # Baseline Sensu client installation
        ├── templates
        │   ├── client.json.j2      # Local Sensu client configutation template
        │   └── rabbitmq.json.j2    # RabbitMQ configuration to communicate
        └── vars
            ├── main.yml            # Base Sensu client configuration
            └── sensu_rabbitmq.yml  # RabbitMQ configuration

While what's been presented in this series may seem imposing, it really isn't if you take it one step at a time. Everything builds off the previous work, and even if you only implement a subset of the solution presented, you still get to reap the rewards of a better-managed, more consistent stack.

A DevOps Workflow, Part 2: Continuous Integration

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Look at you – all fancy with your consistent and easily-managed development environment. However, that's only half of the local development puzzle. Sure, now developers can no longer use "it works on my machine" as an excuse, but all that means is they know that something runs. Without validation, your artisanal ramen may be indistinguishable from burned spaghetti. This is where unit testing and continuous integration really prove their worth.

Unit Testing

You can't swing a dead cat without hitting a billion different Medium posts and Hacker News articles about the One True Way to do testing. Protip: there isn't one. I prefer Test Driven Development (TDD), as it helps me design for failure as I build features. Others prefer to write tests after the fact, because it forces them to take a second pass over a chunk of functionality. All that matters is that you have and maintain tests. If you're feeling really professional, you should make test coverage a requirement for any and all code that is intended for production. Regardless, code verification through linting and tests is a vital part of a good DevOps culture.

Getting Started

Writing a test is easy. For Python, a preferred language at HumanGeo, there exist many different test frameworks and tools. One great option is pytest. It allows you to associate test classes with your code without boilerplate. For example:

# my_code.py

def get_country(country_code):
    return COUNTRIES.get(country_code)
# test_my_code.py

import my_code

def test_get_country(): # All tests start with 'test_'
    assert my_code.get_country('DE') == 'Germany'

When executed, the output will indicate success:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py .

========================== 1 passed in 0.01 seconds ===============================

or failure:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py F

==================================== FAILURES =====================================
________________________________ test_get_country _________________________________

    def test_get_country():
>       assert my_code.get_country('DE') == 'Germany'
E       assert 'Denmark' == 'Germany'
E         - Denmark
E         + Germany

test_code.py:6: AssertionError
============================ 1 failed in 0.03 seconds =============================

The inlining of failing code frames makes it easy to pinpoint the failing assertion, thus reducing unit testing headaches and boilerplate. For more on pytest, check out Jacob Kaplan-Moss's great introduction to the library.

Mocking

Mocking is vital part of the testing equation. I don't mean making fun of your tests (that would be downright rude), but instead substituting fake (mock) objects in place of ones that serve as touchpoints to external code. This is nice because a good test shouldn't care about certain implementation details - just ensure that all cases are correctly handled. This especially holds true when relying on components outside of the purview of your application, such as web services, datastores, or the filesystem.

unittest.mock is my library of choice. To see how it's used, let's dive into an example:

# my_code.py

def country_data_exists():
    return os.path.exists('/tmp/countries.json')
# test_my_code.py

from unittest.mock import patch
import my_code

@patch('os.path.exists')
def test_country_data_exists_success(path_exists_mock):
    path_exists_mock.return_value = True
    data_exists = my_code.country_data_exists()
    assert data_exists == True
    path_exists_mock.assert_called_once_with('/tmp/countries.json')

@patch('os.path.exists')
def test_country_data_exists_failure(path_exists_mock):
    path_exists_mock.return_value = False
    data_exists = my_code.country_data_exists()
    assert data_exists == False
    path_exists_mock.assert_called_once_with('/tmp/countries.json')

The patch function replaces the object at the provided path with a Mock object. These objects use Python magic to accept arbitrary calls and return defined values. Once the function that uses the mocked object has been invoked, we can inspect the mock and make various assertions about how it was called.

If you're using the Requests library (which you should always do), responses allows you to intercept specific requests and return custom data:

# my_code.py

import requests

def get_flag_image(country_code):
    response = requests.get(f'http://example.com/flags/{country_code}.png')
    if not response.ok:
        raise MediaDownloadError(f'Error downloading the image: HTTP {response.status_code}:\n{response.text}')
    return response.content
# test_my_code.py

import pytest
import responses

@responses.activate # Tell responses to intercept this function's requests
def test_get_flag_image_404():
    responses.add(responses.GET, # The HTTP method to intercept
                  'http://example.com/flags/de.gif', # The URL to intercept
                  body="These aren't the gifs you're looking for", # The mocked response body
                  status=404) # The mocked response status
    with pytest.raises(my_code.MediaDownloadError) as download_error:
        my_code.get_flag_image('de')
    assert '404' in download_error.message

More information on mocking in Python can be found here.

Continuous Integration: Jenkins

Throughout this process, we've been trusting our developers when they say their code works locally without issue. The better approach here is to trust, but verify. From bad merges to broad refactors, a host of issues can manifest themselves during the last few phases of task development. A good DevOps culture accepts that these are inevitable and must be addressed through automation. The practice of validating the most recent version of your codebase is called Continuous Integration (CI).

For this, we will use Jenkins, a popular open source tool designed for flexible CI workflows. It has a large community that provides plugins for integration with common tools, such as GitLab, Python Virtual Environments, and various test runners.

Once Jenkins has access to your GitLab instance, it can:

  1. Poll for merge requests targeting the main development branch;
    Jenkins polling for MR screenshot
  2. Attempt a merge of the feature branch into the trunk;
    Jenkins MR detection screenshot
  3. Run your linter;
    Jenkins lint command screenshot
    • Define your acceptable lint severity thresholds
      Jenkins polling for MR screenshot
  4. Run unit tests; and
    Jenkins unit test screenshot
  5. If any of the above steps result in a failure state, Jenkins will comment on the MR. Otherwise, the build is good, and the MR is given the green light.

By integrating Jenkins CI with GitLab merge requests, low quality code can be detected and addressed before it enters your main branch. Newer versions of Jenkins even provide for defining your CI workflow as a file hosted within your repository. This way, your pipeline always corresponds to your codebase. GitLab has also launched a CI capability that may also fit your needs.

This concludes the continuous integration portion of our dive into DevOps. In the next installment, we'll cover deployment!

A DevOps Workflow, Part 1: Local Development

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

How many times have you heard: "That's weird - it works on my machine?"

How often has a new employee's first task turned into a days-long effort, roping in several developers and revealing a surprising number of undocumented requirements, broken links and nondeterministic operations?

How often has a release gone south due to stovepiped knowledge, missing dependencies, and poor documentation?

In my experience, if you put a dollar into a swear jar whenever one of the above happened, plenty of people would be retiring early to spend time on their private islands. The fact that this situation exists is a huge problem.

What would an ideal solution look like? It should ensure consistency of environments, capture external dependencies, manage configuration, be self-documenting, allow for rapid iteration, and be as automated as possible. These features - the intersection of development and operations - make up the practice of DevOps. The solution shouldn't suck for your team - you need to maximize buy-in, and that can't be done when people need to fight container daemons and provisioning scripts every time they rebase to master.

In this series, I'll be walking through how we do DevOps at HumanGeo. Our strategy consists of three phases - local development, continuous integration, and deployment.

Please note that, while I mention specific technologies, I'm not stating that this is The One True Way™. We encourage our teams to experiment with new tools and methods, so this series presents a model that several teams have implemented with success, not official developer guidelines.

Development Environment: Vagrant

In order to best capture external dependencies, one should start with a blank slate. Thankfully, this doesn't mean a developer has to format her computer each time she takes on a new project. Depending on the project, it may be as simple as putting code into a new directory or creating a new virtual environment. However, given the scale of the problems we tackle at HumanGeo, we need to push even further and assemble specific combinations of databases, Elasticsearch nodes, Hadoop clusters, and other bespoke installations. To do so, we need to create sandboxed instances of the aforementioned tools; it's the only sane way to juggle multiple versions of a product when developing locally. There are plenty of fine solutions to this problem, Docker and Vagrant being two of the major players. There's not a perfect overlap between the two, but as they fit in our stack, they're near-equivalent. Since it provides a gentler learning curve, this series will cover Vagrant.

Vagrant provides a means for creating and managing portable development environments. Typically, these reside in VirtualBox virtual machines, although they have support for many different backend providers. What's neat is that, with a single Vagrantfile, you can provision and connect multiple VMs, while automatically syncing code changes made on the host machine (i.e., your computer) to the guest instance (i.e., the Vagrant box).

To get started with Vagrant, you must define your configuration in a Vagrantfile. Here's a sample:

Vagrant.configure("2") do |config|
  config.vm.box = "trusty64"
  config.vm.hostname = "webserver"
  config.vm.network :private_network, ip: "192.168.0.42"

  config.vm.provider :virtualbox do |vb|
    vb.customize [
      "modifyvm", :id,
      "--memory", "256",
    ]
  end

  config.vm.provision :shell, path: "bootstrap.sh"
end

This defines an Ubuntu 14.04 (Trusty Tahr) machine with a fixed private IP, 256mb of RAM, and a bootstrap shell script, which will install needed dependencies and apply software-level configuration. The Vagrantfile can be committed to version control alongside the bootstrap script and your application code so the entire environment can be captured in a single snapshot.

Launching the machine is done with a single command: vagrant up. Vagrant will download the trusty64 base image from a central repository, launch a new instance of it with the hardware and networking states we've defined, and then run the bootstrap file. The image download will only occur once-per image, so future machine initializations will utilize the cached version. Machines can be stopped with vagrant down. You can later re-launch the machine with vagrant up. If you decide that you need to nuke your entire environment from orbit and start over (an immensely useful option), you can do so with vagrant destroy.

To manage these machines, one can connect via SSH just as one would a remote server. The vagrant ssh command will automatically log the user in using public key authentication. From there, a developer can experiment with configuration and other aspects of application development. All ports are exposed to the host machine, so, if a webserver is bound to port 5000, it can be reached from your browser at http://192.168.0.42:5000 (the IP address we assigned to our instance in the Vagrantfile).

Unlike when working with a remote server, you don't need to run a terminal-based editor via SSH, or use rsync every time you save a file in order to make changes to the code on the virtual machine. Instead, the directory that contains your Vagrantfile is automatically mounted as /vagrant/ on the guest, with changes automatically synced back and forth. So, you can use whatever editor you want on the host, while executing code on the VM. Easy.

Provisioning: Ansible

Vagrant itself is only really focused on the orchestration of virtual machines; the configuration of the machines is outside of its purview. As such, it relies on a provisioner - an external tool or script that runs against newly created virtual machines in order to build upon the base image. For example, a provisioner would be responsible for taking a blank Ubuntu installation and installing PostgreSQL, initializing a database, and seeding the database with data.

The example Vagrantfile uses a simple shell script (bootstrap.sh) to handle provisioning. For simple cases, this may well be sufficient. However, if you're doing any serious development, you'll want to move to a more robust configuration management tool. Vagrant ships with support for several different ones, including our preferred tool - Ansible.

Ansible is great in many ways: its YAML-based configuration language is clean and logical, it operates over SSH, has a great community, emphasizes modularity, and doesn't require any custom software be present on your target computers (other than Python 2, with Python 3 support in the technical preview phase). With a little elbow grease, you can even make it idempotent, so there's nothing to fear if you reprovision an instance. Since these provisioning scripts live alongside your code, they can be included in your merge review process, and improve validation of your infrastructure.

Swapping out Vagrant's shell provisioner is extremely straightforward. Just change your provisioner to "ansible", point it at the Ansible configuration script (called a playbook), and you're set! The final provisioning block should now look like this:

config.vm.provision "ansible" do |ansible|
  ansible.playbook = "playbook.yml"
end

Tasks

Ansible's basic building block is a task. Conceptually, a task is an atomic operation. These operations run the gamut from the basic (e.g., set the permissions on a file) to the complex (e.g., create a database table). Here's a sample task:

- name: Install database
  apt: name=mysql-server state=present

The equivalent shell command would be sudo apt-get install mysql-server. Nothing fancy, right?

- name: Deploy DB config
  copy: src=mysql.{{env_name}}.conf dest=/etc/mysql.conf mode=644

There are several things going on here. First, surprise! Ansible is awesome and speaks Jinja2. As such, it will interpolate the variable env_name into the string value for src, resulting in mysql.dev.conf if we were targeting a dev environment (env_name is a convention we use internally for this very purpose). Next, we're invoking the copy module. This doesn't actually copy a file from one remote location to another, it instead copies a local file to a remote destination. This saves you from having to scp the file to your target machine, then remote in to set a permission. It's also far easier to understand at a glance.

- name: Start mysqld
  service: name=mysql state=started enabled=yes

Finally, we ensure that the MySQL service is not only running, but is set to automatically start when the system does. This highlights one of the benefits of Ansible's module system - it masks (and handles) underlying implementation complexities. Whether or not the target machine is using SysV-style inits, Upstart, or systemd, the service module takes care of it for you.

Roles

Tasks can either reside in your playbook, or they can be organized into functional units called roles. Roles not only allow you to group tasks, but also bundle files, templates and other resources, providing for a clean separation of concerns. The tasks above can be placed in a file called tasks/main.yml, resulting in the following directory structure:

roles
└── mysql                   # Tasks to be carried out on DB machines
    ├── files
    │   ├── mysql.dev.conf
    │   └── mysql.prod.conf
    └── tasks
        └── main.yml

Then, all you need to do is reference the role from within your playbook.

Playbooks

These are the entry points for Ansible. A playbook is comprised of one or more plays, each of which possesses several parameters: one or more instances to target, variables to bundle, and a series of tasks (or roles) to execute.

- name: Configure the test environment app server
  hosts: 192.168.1.1
  vars:
    env_name: dev
    es_version: 2.1.0
  roles:
    - common
    - elasticsearch
    - mysql

What is evident in the above example is how Ansible roles help improve modularity and reusability. If I have to install MySQL on several different hosts (e.g., the test app server and the production app server), all I need to do is include the role. Ansible maintains a central repository of roles for developers to customize; most of the time you don't need to write any novel provisioning code.

To invoke the playbook, run ansible-playbook name-of-playbook.yml. If you're using Ansible with Vagrant, you should instead use vagrant provision, as Vagrant will handle the mapping of hosts and authentication. And, no matter how many times you provision, the machine state should remain the same.

This concludes the local development portion of our dive into DevOps. In the next installment, we'll cover continuous integration!

Switching to NeoVim (Part 2)

2016-11-03 Update: Now using the XDG-compliant configuration location.

Now that my initial NeoVim configuration is in place, I'm ready to get to work, right? Well, almost. In my excitement to make the leap from one editor to another, I neglected a portion of my attempt to keep Vim and NeoVim isolated - the local runtimepath (usually ~/.vim).

"But Aru, if NeoVim is basically Vim, shouldn't they be able to share the directory?" Usually, yes. But I anticipate, as I start to experiment with some of the new features and functionality of NeoVim, I might add plugins that I want to keep isolated from my Vim instance.

I'd like Vim to use ~/.vim and NeoVim to use ~/.config/nvim. Accomplishing this is simple - I must first detect whether or not I'm running NeoVim and base the root path on the outcome of that test:

if has('nvim')
    let s:editor_root=expand("~/.config/nvim")
else
    let s:editor_root=expand("~/.vim")
endif

With the root directory in a variable named editor_root, all that's left is a straightforward find and replace to convert all rtp references to the new syntax.

e.g. let &rtp = &rtp . ',.vim/bundle/vundle/'let &rtp = &rtp . ',' . s:editor_root . '/bundle/vundle/'

With those replacements out of the way, things almost worked. Almost.

I use Vundle. I think it's pretty rad. My vimrc file is configured to automatically install it and download all of the defined plugins in the event of a fresh installation. The first time I launched NeoVim with the above changes didn't result in a fresh install - it was still reading the ~/.vim directory's plugins.

Perplexed, I dove into the Vundle code. Sure enough, it appears to default to installing plugins to $HOME/.vim if a directory isn't passed in to the script initialization function. It appears that I was reliant on this default behavior. Thankfully, this was easily solved by passing in my own bundle path:

call vundle#rc(s:editor_root . '/bundle')

And with that, my Vim and NeoVim instances were fully isolated.

Switching to NeoVim (Part 1)

2016-11-03 Update: Now using the XDG-compliant configuration location.

NeoVim is all the rage these days, and I can't help but be similarly enthused. Unlike other editors, which have varying degrees of crappiness with their Vim emulation, NeoVim is Vim.

If it's Vim, why bother switching? Much like all squares are rectangles, but not all rectangles are squares, NeoVim has a different set of aspirations and features. While vanilla Vim has the (noble and important) goal of supporting all possible platforms, that legacy has seemingly held it back from eliminating warts and adding new features. That's both a good thing and a bad thing. Good because it's stable, bad because it can lead to stagnation. A Vim contributor, annoyed with how the project was seemingly hamstrung by this legacy (with its accompanying byzantine code structure, project structure, and conventions), decided to take matters into his hands and fork the editor.

The name of the fork? NeoVim.

It brings quite a lot to the table, and deserves a blog post or two in its own right. I'll leave the diffing as an exercise to the reader. I plan on writing about some of those differences as I do more with the fork's unique features.

So, what did I need to do to switch to NeoVim? I installed it. On Kubuntu, all I needed to do was add a PPA and install the neovim package (and its Python bindings for full plugin support).

$ sudo add-apt-repository -y ppa:neovim-ppa/unstable
$ sudo apt-get update && sudo apt-get install -y neovim
$ pip install --user neovim

Next up, configuration - one of Vim's great strengths. I dutifully keep a copy of my vimrc file on GitHub, and deploy it to any workstation I use for prolonged periods of time. It'd be nice if I could carry it over to NeoVim.

Suprise! It Just Works™. Remember, NeoVim is Vim, as such it shares the same configuration syntax. Since I don't think I'm doing anything too crazy in my vimrc, it should be a drop-in operation.

$ mkdir -p ~/.config/nvim
$ ln -s ~/.vimrc ~/.config/nvim/init.vim

After that, it's a simple matter of invoking nvim from the command line. Everything loaded and worked for me from the first run!

Almost.

This pleasant detour over, I went to resume lolologist development. However, when I activated my virtual environment and fired up nvim, I got a message stating:

No neovim module found for Python 2.7.8. Try installing it with 'pip install neovim' or see ':help nvim-python'.

Hm. That's strange. The relevant help docs, however, tell us all we need to know - the Python plugin needs to be discoverable in our path, and, since I'm using a virtual environment, a different Python instance is being used. This is easily addressed, as detailed in that help doc. However, since I use this vimrc file on two platforms (Linux & OS X), I need to be a little smarter about hardcoding paths to Python executables. I added this to my vimrc (it shouldn't negatively impact my Vim use, so it's fine to be in a shared configuration).

if has("unix")
  let s:uname = system("uname")
  let g:python_host_prog='/usr/bin/python'
  if s:uname == "Darwin\n"
    let g:python_host_prog='/usr/local/bin/python' # found via `which python`
  endif
endif

Restarting NeoVim with that configuration block in place let it find my system Python and all associated plugins.

I'll keep this site updated with any new discoveries and NeoVim experiments! I'm quite eager to see how the client-server integrations flesh out.

I've written more on this! Part 2.

Supercharging your Reddit API Access

This was originally posted to a former employer's blog. It has been mirrored here for posterity.

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we've recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit's API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate. The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they're not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending .json to the URL.)

With little effort a developer could hack together a quick 'n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick 'n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
             .get_hot(limit=5):
    print(str(post))
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?</pre>

The Problem

Now, let's try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author's username, the URL for the submission, a permalink to the comment, as well as its body.

Here's what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.permalink,
        comment.author.name,
        comment.body_html,
        comment.submission.url)

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you'll observe the following pattern:

... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
... comment ...
... comment ...
[WAIT FOR A FEW SECONDS]
(repeating...)

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let's quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
requests.get('http://www.reddit.com/r/python/comments.json?limit=200')
   ....:
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.permalink,
          comment.author.name,
          comment.body_html,
          comment.submission.url)
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?

Digging

According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we're doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let's verify this hypothesis.

To see the underlying requests, we can override PRAW's default log level:

from datetime import datetime
import logging
import praw

logging.basicConfig(level=logging.DEBUG)

r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    save_comment(comment.author.name,
                 comment.submission.url,
                 comment.permalink,
                 comment.body_html)

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let's see what's causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t3',
            'data':{
               'id':'2alblh',
               'media':null,
               'score':0,
               'permalink':'/r/Python/comments/2alblh/django_middleware_that_prints_query_stats_to/',
               'name':'t3_2ajam1',
               'created':1405227385.0,
               'url':'http://imgur.com/QBSOAZB',
               'title':'Should I? why?',
            }
         }
      ]
   }
}

Lets compare that to what we get when listing comments from the /python/comments endpoint:

{
   'kind':'Listing',
   'data':{
      'children':[
         {
            'kind':'t1',
            'data':{
               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'link_author':'mrrrgn',
               'author':'mrrrgn',
               'parent_id':'t3_2alblh',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',
               'link_id':'t3_2alblh',
               'name':'t1_ciwbo37',
               'link_url':'https://github.com/bradmontgomery/django-querycount',
            }
         }
      ]
   }
}

Now we're getting somewhere - there are fields in the per-comment's response that aren't in the subreddit listing's. Of the four fields we're collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it's not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need - a permalink to the comment.

Comment Permalinks

Maybe we can append the comment's ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn't work because reddit expects the submission's title to precede the ID. Referring to the subreddit comment's JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let's give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
print(url)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let's see how much we've improved our execution time:

%%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
                cache_timeout=0)
for comment in r.get_subreddit('Python') \
                .get_comments(limit=200):
    print(comment.author.name,
          comment.body_html,
          comment.id,
          comment.submission.id)
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds - a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit's peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

Accessing Webcams with Python

So, I've been working on a tool that turns your commit messages into image macros, named Lolologist. This was a great learning exercise because it gave me insight into things I haven't encountered before - namely:

  1. Packaging python modules
  2. Hooking into Git events
  3. Using PIL (through Pillow) to manipulate images and text
  4. Accessing a webcam through Python on *nix-like platforms

I might talk to the first three at a later point, but the latter was the most interesting to me as someone who enjoys finding weird solutions to nontrivial problems.

Perusing the internet results in two third-party tools for "python webcam linux": Pygame & OpenCV. Great! Only problem is these come in at 10MB and 92MB respectively. Wanting to keep the package light and free of unnecessary dependencies, I set out to find a simpler solution...

Read more…

Disabling caching in Flask

It was your run of the mill work day: I was trying to close out stories from a jam-packed sprint, while QA was doing their best to break things. Our test server was temporarily out of commission, so we had to run a development server in multithreaded mode.

One bug came across my inbox, flagging a feature that I had developed. Specifically a dropdown that displayed a list of saved user content wasn't updating. I attempted to replicate the issue on my machine, no-luck. Fired up my IE VM to test and, yet again, no luck. Weird.

I walked over to the tester's desk and asked her to walk me through the bug. Sure enough, upon saving an item, the contents of the dropdown did not update. Stumped, I returned to my machine and spoofed her account, wondering if there was an issue affecting her account within our mocked database backend (read: flat file). I was able to see the "missing" saved content. Now I was getting somewhere.

I walked back to her machine, opened the F12 Developer Tools, and watched the network traffic. The GET for that dynamically-populated list was resulting in a 304 status, and IE was using a cached version of the endpoint. Of course, I facepalmed as this was something I've not normally had to deal with as I usually use a cachebuster. However, for this new application, my team has been trying to do things as cleanly as possible. Wanting to keep our hack count down, I went back to my machine to see what our web framework could do.

Coming from a four-year stint in ASP.NET land, I was used to being able to set caching per endpoint via the [OutputCache(NoStore = true, Duration = 0, VaryByParam = "None")] ActionAttribute. Sadly, there didn't appear to be something similar in Flask. Thankfully, it turned out to be fairly trivial to write one.

I found this post from Flask creator Armin Ronacher that set me on the right track, but the snippet provided didn't work for all the Internet Explorer versions we were targeting. However, I was able to whip together a decorator that did:

To invoke it, all you need to do is import the function and apply it to your endpoint.

from nocache import nocache

@app.route('/my_endpoint')
@nocache
def my_endpoint():
    return render_template(...)

I took this back to QA and was met with success. The downside of this manual implementation, however, is one needs to be religious in applying it. To this day we still stumble upon cases where a developer forgot to add the decorator to a JSON endpoint. Thankfully, code review processes are perfect for catching that sort of omission.

Multithreading your Flask dev server to safety

Flask is awesome. It's lightweight enough to disappear, but extensible enough to be able to get some niceties such as auth and ACLs without much effort. On top of that, the Werkzeug debugger is pretty handy (not as nice as wdb's, though).

Things were going swimmingly until our QA server went down. While the server may have stopped, development didn't, and we needed a way to get testable builds up and running for QA. One of my fellow developers quickly stood up an instance of our application on a lightly-used box that was nearing the end of its usefulness. To get around the fact that the machine wasn't outfitted for httpd action, the developer just backgrounded an instance of the Flask development server and established an SSH tunnel for QA to use. This was deemed acceptable by our QA team of one. We rejoiced and went back to work.

Days passed and our system engineer's backlog remained dangerously full, with the QA server remaining a low priority. Thankfully, aside from having to restart the process a few times, the Little Server That Could kept on chugging. But then disaster struck - we hired another tester! Technically her joining was a great thing; when you bring in fresh blood you get not only another person working the trenches but also an influx of fresh ideas. However, her arrival brought some unforeseen trouble - the QA server would get more than one user! This wouldn't normally be an issue, but at the moment our QA server was nothing more than a single threaded developer script running as an unmanaged process. Oops.

Sure enough the complaints from QA started bubbling forth: "Test is down!", "Test isn't loading!", "Test is unusable!" To preserve harmony in the office, and to hold to our end of the developer-tester bargain, we had to do something (the sysengineer was busy rebuilding a dead Solr cluster). We needed to buy some time, so I started investigating what we could do with what we already had - maybe there was a way to handle the load with the development server.

Flask uses the Werkzeug WSGI library to manage its server, so that would be a good place to start. Sure enough, the docs state that the WSGIref wrapper accepts a parameter specifying whether or not the it should use a thread per request. Now, how to invoke this from within Flask?

This, much like with everything else in the framework, was surprisingly easy. Flask's app.run function accepts an option collection which it passes on to Werkzeug's run_simple. I updated our dev server startup script...

app.run(host="0.0.0.0", port=8080, threaded=True)

... and then threw it back over the wall to QA. I moseyed on over to test-land shortly thereafter to discover them chipping away at a backlog of tasks, with the webserver serving away as if nothing happened.

A few weeks weeks later, the new test server is almost ready for prime time, and our multithreaded Flask server (which should never be used for any sort of production purposes) is still holding down the fort.