Migrating from Gmail to Fastmail

In 2004, Google launched Gmail. This service changed everything. You didn't have to worry about running over your few megabytes of quota - your storage space was "unlimited" (with a ticker and everything)! Deletion was a thing of the past, you now archive! Folders were so ninety-ninety-late, there were labels! I got an invite within the first two weeks of it launching, and it was good.

A decade or so later, Google launched Inbox, which brought innovations like bundles, snoozing, highlights, pinning, sweeping, and smart filtering to the deluge of email that flooded your account each day. I switched from Gmail to Inbox, and it was better.

Then, in true Google fashion, it was sacrificed at the altar of project mismanagement (or whatever the lack of product strategy is called). And it was bad.

Since being forced back to Gmail, I've constantly lamented the death of a service that made dealing with email less painful for me. Gmail is not only without new innovation, but it's also slow; It regularly fails to load new messages, or seemingly loses track of what it should be showing, necessitating a hard refresh. Sure, Google has thrown a few bones at it, like smart replies, but I respond to so few emails that spending a few seconds to formulate a response has never been an issue.

Given these concerns, I realized that the "stickiness" of Gmail was gone. I have the means to pay for service, and nothing is keeping me on Gmail (other than the fact that everyone's been using my Gmail address for 16 years). Leaving Gmail sounded doable, and I owned a personal domain on that I'd love to use for email. The only question was: where do I go for hosting?

Assessing the Field

At the time of my research, there were a handful of buzzy, well-regarded options, both free and paid.

  • Gsuite would do nothing other than allow me better custom domain integration. It doesn't help that Gsuite accounts are second class citizens in the Google ecosystem.
  • Outlook Premium is Microsoft's offering. However, I'm wary of Microsoft pushing Exchange over more open protocols, so an instant no-go.
  • Self hosting is always an option, but it's risky. Unless one is willing to keep up with security patches, harden their networks, manage backups, and provide reliable connectivity, this is a nonstarter. My time is worth more than what I'd save dealing with these potential headaches, so I wrote it off.
  • Protonmail acquired a decent amount of hype because of their security and privacy-focused approach to hosting. They went as far as making it so that you need special tooling to decrypt your messages. This is an attractive offering, but also makes it so that you need to run a compatibility layer to use their service with standard email clients. Additionally, they don't offer a calendaring service (which is nice to have as most calendar applications use one's email address as a form of identity). Additional storage is also expensive for their service, costing $1/mo/GB over 5GB.
  • Hey is a service by DHH (of Rails/Basecamp fame) that claims to have a solution for to many of the annoyances that taint ones relationship with their inbox. They're currently in a public beta, and I have an invite. Unfortunately, they're a nonstarter: no custom domain support. This is on their roadmap, but until it happens, no deal. They're also the most expensive provider on this list at $99/year.
  • Fastmail is another well-regarded provider. They've been around for a while, have a competitive feature set, and present an air of competence. Of particular interest to me was their Gmail import/sync tool, and robust custom domain support.

While Protonmail was tempting with its privacy-first approach, email is fundamentally insecure. Given my threat model, I went with Fastmail as the sheer amount of functionality won me over.

Switching to Fastmail

Making the jump was pretty straightforward. I created an account and opted to start with my custom domain. After following their detailed instructions for configuring my DNS settings, I was off to the races!

Gmail migration

Next, I had to address my Gmail inbox - I'd like to have all my messages in one place. I once tried downloading my entire Gmail inbox via IMAP. It took ages. Thankfully, Fastmail's utility was quicker than that, and everything synced over in a few hours.

Once enabled, Fastmail's Gmail sync continues to run. This means that I can gradually migrate everything to my new address while continuing to receive email for both in my new inbox.

The clients

I use Gmail for the web, and for Android. While the Gmail client can read IMAP, I opted to jump into the deep end and switched to the Fastmail web and mobile clients.


The web client is noticeably snappier for me than the bloated mess that Gmail has become, so switching was generally an upgrade. It also had the added benefit of consuming far less RAM. They have a comparable set of keyboard shortcuts, and a well-supported native dark mode. The only thing I find myself missing is having my unread message count in the favicon.


It's pretty obvious that it's not a native app. My guess is Fastmail went with a cross-platform framework, like React Native, to optimize for development speed over user experience. As a result, all the patterns for the platform feel "off".

Performance is okay. I'll occasionally hit stutters, or overly long loading indicators, and the frame rate doesn't feel similar to native apps on my 90 Hz display. Offline caching is hit or miss, and there's no ability to adjust caching per label.

A positive, however, is that one can do almost everything on mobile that they can do on desktop, including setting filters (unlike with Gmail). And notifications have been timely and prompt.

At some point, I'll probably kick the tires on a new app.


  • Calendaring - Like Google, Fastmail offers an integrated calendaring feature. They don't offer automatic event detection, but other features work. I'm not quite ready to switch calendaring services, so I was really happy to discover that, once connected to my Google Calendar, Fastmail calendar lets me use it as my default. This is wonderfully considerate.
  • Hangouts - Hi, it's me: one of the 3 people who still use Hangouts. Naturally, it's not integrated with the Fastmail webapp, but there's a standalone web client.
  • Smart Categorization - I didn't realize just how reliant I was on Google's smart email categorization. Unfortunately, I'm now back to having to set filter rules to label emails. While tedious for the first few weeks, the work naturally lessened. Additionally, it forced me to confront just how many unnecessary mailing lists I was on, at which point I KonMari'd them from my life.

5 months later...

I wrote most of the above sections 5 months ago. I, a procrastinator, didn't get around to finishing this until just now. In that time, I've had no issues with Fastmail, and plan on continuing to use their service. If this interests you, save on your first year with my referral link!

Optimizing Rust Binary Size

I develop and maintain a git extension called git-req. It enables developers to check out pull requests from GitHub and GitLab by their number instead of branch name. It initially started out as a bash script that invoked Python for harder tasks (e.g., JSON parsing). It worked well enough, but I wanted to add functionality that would have been painful to implement in bash. Additionally, one of my goals was to make it as portable as possible, and requiring a Python distribution be available flew against that. That meant that I needed to distribute this as a binary instead of a script, so I set about finding a programming language to use. After surveying what was available, and determining what would be the best addition to my toolbox, I selected Rust.

The programming language has a steep learning curve, but has been fun to learn and immerse myself within. The community is great, and I'm excited to find more opportunities to use Rust in the future.

The rewrite took a while to accomplish, but when all was said and done, everything worked, and worked well. I was able to implement some snazzy new features as well as polish some rough edges. However, for how "simple" I felt the underlying program to be, it clocked in at 13 megabytes. That felt like a lot. So, I decided to see what could be done.

For those playing along at home, the starting binary size was: 13535712 bytes (12.9MB).

Phase 1: Building

The first thread I pulled was ensuring that the compiler would output code in such a way that it prioritized disk space over speed. I'm fine with the build taking slightly more time, as well as with the program being slightly slower - most commands incur network traffic, so a few extra milliseconds of CPU time are nothing in comparison. I found two simple additions to my Cargo.toml got me all I needed:

1. Optimization Level

The optimization level instructs the compiler as to what trade-offs it should make at compile time. One can opt for longer compile times and larger file size in exchange for faster run times, or instead request a smaller file size for longer compile times and slightly slower run times. To turn this knob, add the following to Cargo.toml:

opt-level = "s"

Possible opt-level values include:

  • 0: no optimizations
  • 1: basic optimizations
  • 2: some optimizations
  • 3: all optimizations
  • "s": optimize for binary size
  • "z": optimize for binary size, but also turn off loop vectorization.

The docs encourage experimentation - I strongly suggest heeding that guidance. My initial guess was "z", which seemed like the most extreme option. After testing all possible values, it turned out "s" resulted in the smallest binary size.

New binary size: 12832464 bytes (12.2MB).

2. Link Time Optimization (LTO)

Link Time Optimization is an optimization phase that the compiler carries out where it assesses the entire program (instead of an individual file) to determine if there are optimizations to be made (e.g., removing dead code). To enable it, add the following to Cargo.toml:

lto = true
codegen-units = 1

This instructs the Rust compiler to apply a "full" set of optimizations only when building for release. Possible lto values include:

  • false: LTO only across the crate or its codegen units.
  • true or "fat": LTO across all crates in the dependency graph.
  • "thin": similar to "fat", but faster to run while offering similar gains to "fat".
  • "off": No LTO

The codegen-units setting limits how many pieces the compiler may split the crate into in order to optimize build parallelization. One of the great things Rust's borrow checker enables is fearless parallelization, which it and its tooling exploit. By setting this value to 1, I was able to ensure that the linking phase would not parallelize, and instead consider the full codebase, thus ensuring that the code was properly optimized (at the expense of longer compilation times).

New binary size: 8338640 bytes (8.0MB).

62% of the original binary size. Nice, but why stop there?

Phase 2: Trimming the Fat

Now that we've made the compiler play nicely, where else can we get some gains? Since Rust ships with a fairly minimal standard library, developers rely on its robust package ecosystem for things like JSON serialization and HTTP requests. One issue with this is that external dependencies are the primary vector for bloat in any application. If only there were a way to measure such bloat in a Rust application...

... oh wait, there is: cargo-bloat.

Running it against git-req with the --release --crates flags outputs:

 File  .text     Size Crate
 4.4%  11.3% 360.4KiB reqwest
 3.8%   9.6% 306.0KiB std
 3.3%   8.4% 267.5KiB clap
 3.2%   8.1% 259.5KiB regex
 2.9%   7.5% 237.9KiB regex_syntax
 2.6%   6.5% 208.3KiB [Unknown]
 2.4%   6.1% 193.3KiB rustls
 1.3%   3.2% 103.1KiB goblin
 1.2%   3.2% 101.3KiB backtrace
 1.2%   3.2% 100.4KiB libgit2_sys
 1.1%   2.9%  93.3KiB yaml_rust
 1.1%   2.7%  86.6KiB git_req
 1.1%   2.7%  85.8KiB ring
 1.0%   2.7%  84.6KiB unicode_normalization
 0.9%   2.3%  74.0KiB object
 0.9%   2.2%  70.0KiB h2
 0.7%   1.7%  55.0KiB hyper
 0.5%   1.3%  41.6KiB http
 0.5%   1.3%  41.5KiB duct
 0.5%   1.3%  40.1KiB term
 4.4%  11.3% 361.4KiB And 79 more crates. Use -n N to show more.
39.1% 100.0%   3.1MiB .text section size, the file size is 8.0MiB

Wow - git-req (git_req) only accounts for 1.1% of the file, with a long tail of crates bringing up the rear. More interestingly, there are a few crates that dominate the file size. Let's tackle the big one: reqwest.

As someone who does a lot of Python development, when I first started with Rust I wanted something that mimicked the ergonomics of the popular Requests library. The phonetically-similar reqwest offers just that. Unfortunately, it was pretty big, and there appeared to be a lot of the library that I wasn't using, nor was I planning on using. With those two points, I recognized that this library was a prime candidate for replacement.

I started with this post that discussed the merits of various HTTP crates available to developers. Of importance to me were those that had: rustls support, serde support, minimal use of unsafe, and a sane API. Based on those criteria, ureq checked all the boxes.

Replacing reqwest with ureq was fairly straightforward. Let's see how this manifests in file size...

File  .text     Size Crate
 3.9%  10.5% 267.5KiB clap
 3.8%  10.1% 258.2KiB regex
 3.5%   9.3% 237.2KiB regex_syntax
 3.4%   9.1% 231.3KiB std
 3.1%   8.2% 208.2KiB [Unknown]
 2.7%   7.2% 182.4KiB rustls
 1.5%   4.0% 103.1KiB goblin
 1.5%   4.0% 101.4KiB backtrace
 1.5%   3.9% 100.4KiB libgit2_sys
 1.4%   3.8%  95.9KiB yaml_rust
 1.3%   3.6%  91.6KiB git_req
 1.2%   3.3%  84.6KiB unicode_normalization
 1.2%   3.2%  82.3KiB ureq
 1.1%   2.9%  74.0KiB object
 1.1%   2.8%  71.8KiB ring
 0.6%   1.6%  41.6KiB duct
 0.6%   1.6%  40.0KiB term
 0.6%   1.6%  39.9KiB url
 0.6%   1.5%  37.7KiB time
 0.4%   1.0%  26.7KiB rustc_demangle
 2.3%   6.2% 158.7KiB And 43 more crates. Use -n N to show more.
37.4% 100.0%   2.5MiB .text section size, the file size is 6.6MiB

Wow, the piece of functionality that was the biggest offender is now not even in the top 10.

Binary size: 6967472 bytes (6.7MB).

Phase 3: Things I'm Not Comfortable Doing Yet

In my research, I stumbled upon some optimizations that I wasn't comfortable applying yet - mostly because I want to spend some time to ensure there won't be any runtime implications for git-req.

1. Stripping

Most rendered binaries have to ship with content to support all possible use-cases for the application. If you know how a binary will be used, you can apply post-processing to it to strip out the unnecessary portions. The strip tool is one of the more popular utilities that does this. Applying it to git-req yields substantial savings: 4718240 bytes (4.5MB)! Why wouldn't I want to ship this immediately? One word: backtraces.

When a release-grade Rust application panics, if the RUST_BACKTRACE environment variable is set to 1, the application will print out a backtrace before it dies. This is immensely useful for debugging, and, given the amount of variance in the environments this application is running, I feel that playing file size golf at the expense of supportability is out of the question... for now.

2. Features

Rust has the concept of "Features", which allow developers to explicitly modify parts of the application at compile-time. In the case of git-req, the color-backtrace library is especially useful to me, the program's author, because scrutinizing backtraces is a regular part of my workflow. Hopefully, this is significantly less of a problem for end-users, so whatever benefit they may get is minimal, at best. I could update this to be hidden behind a feature flag, enabling me to not ship the library. Given it isn't in the top 100 contributors to bloat in git-req, I consider if not worth the effort to implement and maintain.

In Closing

Question everything - gains are to be had. Check out git-req!

Elasticsearch Frustration: The Curious Query

Last year I was poking at an Elasticsearch cluster to review the indexed data and verify that things were healthy. It was all good until I stumbled upon this weird document:

  "_version": 1,
  "_index": "events",
  "_type": "event",
  "_id": "_query",
  "_score": 1,
  "_source": {
    "query": {
      "bool": {
        "must": [
            "range": {
              "date_created": {
                "gte": "2016-01-01"

It may not be immediately obvious what's going on in the above snippet. Instead of a valid event document, there's a document with a query as the contents. Additionally, the document ID appears to be _query instead of the expected GUID. The combination of these two irregularities makes it seem as if someone accidentally posted a query to the wrong endpoint. No problem, just delete the document, right?

DELETE /events/event/_query
ActionRequestValidationException[Validation Failed: 1: source is missing;]


I reached out to some of my coworkers to see if they could point me in the right direction, but all that I received was an (unhelpful) "I've seen this error before, and we solved it, but no one seems to remember how it was done." Great.

After much head-scratching, it turns out that, since the ID is _query, Elasticsearch's URL router thinks that I'm trying to issue a query and validates the HTTP action as such. Part of that validation is the requirement that queries have a body. Oops.

While passing an empty object should conceivably have worked, I wanted to play things extra safe in case ES was executing the query (this was production, after all (why are you looking at me like that?)), so I passed in a query object that constrained the results to only the problematic document.

DELETE /compositeevents/compositeevent/_query
    "query": {
        "match": {
            "_id": "_query"

... and the document was deleted successfully! Hopefully putting this to blog form will help others who encounter it in the future (including me).

A DevOps Workflow, Part 3: Deployment

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Congratulations, your code looks good! Now all you need to do is put your application in front of your users to discover all the creative ways they'll break it. In order to do this, we'll have to create our instances, configure them, and deploy our code.

CloudFormation: Infrastructure Definition

Amazon Web Services (AWS) is a common target for HumanGeo deployments. Traditionally, when one creates resources on AWS, one uses the management console interface. While this is a good way to experiment with an environment, it cannot be automated, nor can it be managed under version control. Amazon, recognizing that the web console is insufficient for serious provisioning and scaling purposes, provides a series of tools for application deployment. The one that best fits our needs is CloudFormation.

CloudFormation allows you to define your infrastructure as a collection of JSON objects. For example, an EC2 instance can be declared with the following block:

"ElasticSearchInstance": {
    "Properties": {
        "EbsOptimized": "true",
        "ImageId": { "Ref": "ImageId" },
        "InstanceType": { "Ref": "EsInstanceType" },
        "KeyName": { "Ref": "KeyName" },
        "NetworkInterfaces": [{
            "DeviceIndex": "0",
            "GroupSet": [
                { "Ref": "ElasticsearchSecurityGroup" },
                { "Ref": "SSHSecurityGroup" }
            "PrivateIpAddress": "",
            "SubnetId": { "Ref": "Subnet" }
        "Tags": [{
            "Key": "Application",
            "Value": { "Ref": "AWS::StackId" }
        }, {
            "Key": "Class",
            "Value": "project-es"
        }, {
            "Key": "Name",
            "Value": "project-es01"
    "Type": "AWS::EC2::Instance"

If you're familiar with EC2, much of the above should make sense to you. Fields with Ref objects are cross-references to other resources in the CloudFormation stack - both siblings and parameters. Once written, the JSON document can be uploaded to AWS and then run. What's really cool here is that we can do this with an Ansible task!

Since we prefer to maintain a separation between our instance provisioning and cloud provisioning scripts, our CloudFormation tasks usually reside in a standalone playbook named amazon.yml.

- name: Apply the CloudFormation template
    stack_name: proj_name
    state: present
    region: "us-east-1"
    template: "files/project-cfn.json"
      KeyName: project-key
      EsInstanceType: "r3.large"
      ImageId: "ami-d05e75b8"
      Stack: "project-core"

This will not only upload the stack template to AWS, but it also will instantiate the stack with the provided parameters, which can be either constants or Ansible variables. The world is your oyster! Unlike with other AWS wrappers, CloudFormation is stateful, storing stack identifiers and only updating what needs to be updated.

After working with CloudFormation at scale, some warts really started getting to us - many of which stemmed from the fact that the templating language is JSON. Updating a template is painful. Its usage of strings instead of variables makes validation difficult, and there can be a significant amount of repetition if you have several similar resources. Thankfully, there exists a solution in the form of the awesome Python library troposphere. It provides a way to write a CloudFormation template in Python, with all the benefits of a full programming language. The tropospheric equivalent of our Elasticsearch stack is:

t = Template()
    Tags=Tags(Application=Ref(stack_id), Name='project-es01', Class='project-es'),
        GroupSet=[Ref(ssh_sg), Ref(es_sg)],

Since we're using bare variable names, we can use static analysis tools like Pylint to validate the template. Additionally, now everything can be scripted! Want to make multiple instances with the same configuration? With JSON, you were stuck copy-pasting the same chunks of text multiple times. With troposphere, it's just a matter of wrapping the instance definition in a function and invoking it multiple times.

When you're ready to apply your template, simply execute it to get CloudFormation-compatible JSON, and you're good to go.

Provisioning: Ansible

Ansible was discussed in the local development post, but here it is again! Assuming you were principled in writing your local development playbook, aiming at the AWS cloud is pretty straightforward.

First, you'll need to make Ansible aware of your cloud instances. Sure, you could manually define the host IPs in your inventory, but that means you'll have to manually update the mapping of hosts to IP addresses any time you need to recreate an instance. If only there was some way to dynamically target these machines...


There is! Ansible provides a means to use a dynamic inventory backed by AWS. Once you have your credentials configured, you can use any set of EC2 attributes to target your resources. Since we tend to provision clusters of machines in addition to standalone instances, it'd be nice to have a more general attribute selector than tag_Name_project_es01. This can be accomplished by applying our own ontology to our EC2 instances using tags. Notice the Class tag in the CloudFormation examples above. While every Elasticsearch instance we deploy will have a different Name tag, they'll all share a Class tag of tag_Class_project_es. Get in the habit of using the project name as a prefix everywhere since tags are global for your account.

When using the dynamic inventory, plays look like this:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
    - common
    - es

With that, ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem will target all EC2 instances with a Class tag of project-es and apply the common and es roles.

One other aspect of your cloud deployment is that it may require secrets. You may need to store passwords for an emailer or private keys for encrypted RabbitMQ channels. Under normal circumstances, these wouldn't (and shouldn't) be stored in version control. However, once again our good pal Ansible swings by to help us out. Enter, Ansible Vault.

For this example, we want to manage SMTP credentials using Ansible. First, let's create a secrets file: ansible-vault create vars/secrets.yml

You'll be prompted for a password. Remember this, as it's the only way you can decrypt the file. Now, lets add the variables to our file:

smtp_username: smtp_user@mydomain.biz
smtp_password: s3cur3!

Save and exit. Ansible Vault will automatically encrypt the contents. Now, you can reference the secrets in your playbook:

- name: Build Elasticsearch instances
  hosts: tag_Class_project_es
  gather_facts: yes
  remote_user: ubuntu
  become: yes
  become_method: sudo
    - common
    - es
    - vars/secrets.yml

Your roles don't need to know that the variables are encrypted, you can reference them just as you would any other variable. Decryption happens at runtime, and requires an additional argument be provided: ansible-playbook -i inventory/ production.yml --private-key /path/to/project.pem --ask-vault-pass. When run, Ansible will prompt you for the vault password. If it's correct, the file will be decrypted and the variables will be available for your tasks. If the password is incorrect, you're dropped back to the console and prompted to try again.

Continuous Delivery: Jenkins

At this point, we're now able to fully provision our cloud deployments with Ansible. Most teams stop here, but we take it even further. It's immensely useful to developers and other stakeholders to see just how things are shaping up, and catch any undetected integration bugs. To achieve this, we turn once more to our friends, Ansible and Jenkins, to create a test environment for us.

First, in order to prevent bugs in test from affecting the production environment, we must instantiate a standalone test environment. It's up to you to determine just how substantial this abstraction will be. For our purposes, we'll be creating separate EC2 instances, but nothing else. This is where troposphere once more proves its worth. We can wrap the generation of the relevant stack components behind a function that returns either a set of "prod" resources, or "test" ones. For example:

def get_instance(*, is_production):
    suffix = '-test' if not is_production else ''
    return Instance(
        # edited for brevity
            PrivateIpAddress=f'10.0.0.{1 + (10 if is_production else 100)}',


Once those resources are instantiated, it's a simple matter of tweaking your production playbook to support test-environment specific features - e.g., enable verbose logging and debug mode.

This gets you most of the way to a fully automated test environment. However, there is the small matter of what actually triggers the Ansible deployment. Jenkins comes to the rescue… again. Since we're deploying from our main develop branch, we can set up a downstream project that provisions the test instance as follows:

  1. Install the Ansible plugin for Jenkins
  2. Create a new project.
    Project creation prompt
  3. Have it run only when the project that periodically validates our central branch succeeds.
    Project trigger prompt
  4. When this project runs, it should invoke the Ansible playbook.
    Ansible playbook prompt

You will have to make your private key available to Jenkins. If this concerns you, you can also provision an additional set of private keys solely for Jenkins, so, if you need to revoke access in the future, you don't have to go through the hassle of creating new private keys for the main account.

Now, when someone pushes to your development branch and the code is satisfactory (i.e., passing unit tests and lint checks), Jenkins will update the contents of your test server. Pure developer-driven architecture!

You can take this one step further and have Jenkins do something similar for your production environment, just triggered differently. This could utilize a GitFlow-like branching strategy where the master branch contains production-quality code, so updates there trigger a production deployment. Jenkins is pretty flexible, so, more often than not, you can contrive a combination of triggers and preconditions that will downselect to the conditions that you want to cause a deployment.


Congrats! Now that you've got a fully automated and tested workflow, what are you going to do?


Nope. You've got this masterpiece, yes, but how do you know it's actually running? When one manually deploys code, one usually follows up by clicking around on the site, checking out system load, etc. Jenkins does nothing of the sort, and neither does your application. Without manually checking, how do you know that the last commit didn't have an overly verbose method that's spammed your system logs to the point you don't have any free space? Or, how do you know that Supervisord wasn't misconfigured and is actually stopped? Much like with Schrödinger's cat, you don't. Time to start monitoring our stack.

Thankfully, commercial and free monitoring solutions can do this for you (except for Nagios - please stop). If your disk is full, send an email to the ops team. If you're starting to see some slow queries manifest themselves, nag the nerds in your #developers Slack channel. The tool we've moved to from Nagios is Sensu.

Sensu utilizes a client-server model, where the central server coordinates checks across a variety of nodes, and nodes run a client service that executes checks and collects metrics. Each client-side heartbeat sends data to the central server, which manages alerting states, etc. We also run Uchiwa, which provides a great dashboard for managing your monitoring environment.

Internally, we run a single Sensu server, to which all of our clients connect via an encrypted RabbitMQ channel. The encryption keys (and configuration) are deployed to the clients via a common sensu Ansible Vault-protected role. What varies from deployment to deployment are the checks that are carried out. Within our playbooks, we define various subscriptions for a set of machines…

- name: Monitor the collector instance
  hosts: tag_Class_project_collector
  remote_user: ubuntu
  become: yes
  become_method: sudo
    - role: sensu
      subscriptions: [ disk, collector, elasticsearch ]

We then customize the block of tasks that provision our checks, using when clauses that test the subscriptions collection:

- name: check elasticsearch cluster health
    name: elastic-health
    command: "{{ plugins_base_path }}/check-es-cluster-status.rb -h {{ hostvars[groups['tag_Name_project_es01'][0]]['ec2_private_ip_address'] }}"
    handlers: project-mailer
    standalone: yes
    interval: 300
  notify: restart sensu-client service
  when: "'elasticsearch' in subscriptions"

There are two awesome things going on here. The first is that Ansible ships with a Sensu check module. This is far nicer than maintaining our own templated JSON. Additionally, check out that hostvars statement! This taps into the power of the AWS dynamic inventory to allow for attribute lookups.

That's It

Whew! After all this, what does our final provisioning setup look like?

├── amazon.yml                 # Provision our AWS infrastructure
├── files                      # Top-level provisioning resources
│   ├── Makefile               # Compiles the troposphere script
│   ├── project-cfn.py         # troposphere script
│   ├── project-stack.json     # Compiled troposphere script
├── group_vars
│   └── all
│       └── vars.yml           # Global Ansible vars
├── inventory
│   ├── base
│   ├── ec2.ini
│   └── ec2.py                 # Ansible AWS discovery script
├── production-monitoring.yml  # Provision infrastructure with Sensu software
├── production.yml             # Provision infrastructure with software
└── roles
    ├── cloudformation         # Applies the CloudFormation template with our parameters
    │   └── tasks
    │       └── main.yml
    ├── es
    │   ├── defaults
    │   │   └── main.yml       # Default ES variables, can be overridden by plays
    │   ├── tasks
    │   │   ├── main.yml       # ES provisioning tasks
    │   └── templates
    │       ├── elasticsearch.yml.j2 # Jinja2 ES config template
    │       ├── kibana.yml.j2  # Kibana config template
    │       └── nginx.conf.j2  # NGINX config template
    └── sensu
        ├── defaults
        │   └── main.yml       # Ansible Vault encrypted Sensu certificates and passwords
        ├── files
        │   └── sudoers-sensu  # A config file to be applied on some systems. Not templated.
        ├── handlers
        │   └── main.yml       # Tasks that are triggered by changes within Sensu
        ├── tasks
        │   ├── checks.yml     # Tasks that are activated on each target based on subscriptions
        │   └── main.yml       # Baseline Sensu client installation
        ├── templates
        │   ├── client.json.j2      # Local Sensu client configutation template
        │   └── rabbitmq.json.j2    # RabbitMQ configuration to communicate
        └── vars
            ├── main.yml            # Base Sensu client configuration
            └── sensu_rabbitmq.yml  # RabbitMQ configuration

While what's been presented in this series may seem imposing, it really isn't if you take it one step at a time. Everything builds off the previous work, and even if you only implement a subset of the solution presented, you still get to reap the rewards of a better-managed, more consistent stack.

A DevOps Workflow, Part 2: Continuous Integration

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

Look at you – all fancy with your consistent and easily-managed development environment. However, that's only half of the local development puzzle. Sure, now developers can no longer use "it works on my machine" as an excuse, but all that means is they know that something runs. Without validation, your artisanal ramen may be indistinguishable from burned spaghetti. This is where unit testing and continuous integration really prove their worth.

Unit Testing

You can't swing a dead cat without hitting a billion different Medium posts and Hacker News articles about the One True Way to do testing. Protip: there isn't one. I prefer Test Driven Development (TDD), as it helps me design for failure as I build features. Others prefer to write tests after the fact, because it forces them to take a second pass over a chunk of functionality. All that matters is that you have and maintain tests. If you're feeling really professional, you should make test coverage a requirement for any and all code that is intended for production. Regardless, code verification through linting and tests is a vital part of a good DevOps culture.

Getting Started

Writing a test is easy. For Python, a preferred language at HumanGeo, there exist many different test frameworks and tools. One great option is pytest. It allows you to associate test classes with your code without boilerplate. For example:

# my_code.py

def get_country(country_code):
    return COUNTRIES.get(country_code)
# test_my_code.py

import my_code

def test_get_country(): # All tests start with 'test_'
    assert my_code.get_country('DE') == 'Germany'

When executed, the output will indicate success:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py .

========================== 1 passed in 0.01 seconds ===============================

or failure:

=============================== test session starts ===============================
platform darwin -- Python 3.6.0, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /private/tmp, inifile:
collected 1 items

test_code.py F

==================================== FAILURES =====================================
________________________________ test_get_country _________________________________

    def test_get_country():
>       assert my_code.get_country('DE') == 'Germany'
E       assert 'Denmark' == 'Germany'
E         - Denmark
E         + Germany

test_code.py:6: AssertionError
============================ 1 failed in 0.03 seconds =============================

The inlining of failing code frames makes it easy to pinpoint the failing assertion, thus reducing unit testing headaches and boilerplate. For more on pytest, check out Jacob Kaplan-Moss's great introduction to the library.


Mocking is vital part of the testing equation. I don't mean making fun of your tests (that would be downright rude), but instead substituting fake (mock) objects in place of ones that serve as touchpoints to external code. This is nice because a good test shouldn't care about certain implementation details - just ensure that all cases are correctly handled. This especially holds true when relying on components outside of the purview of your application, such as web services, datastores, or the filesystem.

unittest.mock is my library of choice. To see how it's used, let's dive into an example:

# my_code.py

def country_data_exists():
    return os.path.exists('/tmp/countries.json')
# test_my_code.py

from unittest.mock import patch
import my_code

def test_country_data_exists_success(path_exists_mock):
    path_exists_mock.return_value = True
    data_exists = my_code.country_data_exists()
    assert data_exists == True

def test_country_data_exists_failure(path_exists_mock):
    path_exists_mock.return_value = False
    data_exists = my_code.country_data_exists()
    assert data_exists == False

The patch function replaces the object at the provided path with a Mock object. These objects use Python magic to accept arbitrary calls and return defined values. Once the function that uses the mocked object has been invoked, we can inspect the mock and make various assertions about how it was called.

If you're using the Requests library (which you should always do), responses allows you to intercept specific requests and return custom data:

# my_code.py

import requests

def get_flag_image(country_code):
    response = requests.get(f'http://example.com/flags/{country_code}.png')
    if not response.ok:
        raise MediaDownloadError(f'Error downloading the image: HTTP {response.status_code}:\n{response.text}')
    return response.content
# test_my_code.py

import pytest
import responses

@responses.activate # Tell responses to intercept this function's requests
def test_get_flag_image_404():
    responses.add(responses.GET, # The HTTP method to intercept
                  'http://example.com/flags/de.gif', # The URL to intercept
                  body="These aren't the gifs you're looking for", # The mocked response body
                  status=404) # The mocked response status
    with pytest.raises(my_code.MediaDownloadError) as download_error:
    assert '404' in download_error.message

More information on mocking in Python can be found here.

Continuous Integration: Jenkins

Throughout this process, we've been trusting our developers when they say their code works locally without issue. The better approach here is to trust, but verify. From bad merges to broad refactors, a host of issues can manifest themselves during the last few phases of task development. A good DevOps culture accepts that these are inevitable and must be addressed through automation. The practice of validating the most recent version of your codebase is called Continuous Integration (CI).

For this, we will use Jenkins, a popular open source tool designed for flexible CI workflows. It has a large community that provides plugins for integration with common tools, such as GitLab, Python Virtual Environments, and various test runners.

Once Jenkins has access to your GitLab instance, it can:

  1. Poll for merge requests targeting the main development branch;
    Jenkins polling for MR screenshot
  2. Attempt a merge of the feature branch into the trunk;
    Jenkins MR detection screenshot
  3. Run your linter;
    Jenkins lint command screenshot
    • Define your acceptable lint severity thresholds
      Jenkins polling for MR screenshot
  4. Run unit tests; and
    Jenkins unit test screenshot
  5. If any of the above steps result in a failure state, Jenkins will comment on the MR. Otherwise, the build is good, and the MR is given the green light.

By integrating Jenkins CI with GitLab merge requests, low quality code can be detected and addressed before it enters your main branch. Newer versions of Jenkins even provide for defining your CI workflow as a file hosted within your repository. This way, your pipeline always corresponds to your codebase. GitLab has also launched a CI capability that may also fit your needs.

This concludes the continuous integration portion of our dive into DevOps. In the next installment, we'll cover deployment!

A DevOps Workflow, Part 1: Local Development

This series is a longform version of an internal talk I gave at a former company. It wasn't recorded. It has been mirrored here for posterity.

How many times have you heard: "That's weird - it works on my machine?"

How often has a new employee's first task turned into a days-long effort, roping in several developers and revealing a surprising number of undocumented requirements, broken links and nondeterministic operations?

How often has a release gone south due to stovepiped knowledge, missing dependencies, and poor documentation?

In my experience, if you put a dollar into a swear jar whenever one of the above happened, plenty of people would be retiring early to spend time on their private islands. The fact that this situation exists is a huge problem.

What would an ideal solution look like? It should ensure consistency of environments, capture external dependencies, manage configuration, be self-documenting, allow for rapid iteration, and be as automated as possible. These features - the intersection of development and operations - make up the practice of DevOps. The solution shouldn't suck for your team - you need to maximize buy-in, and that can't be done when people need to fight container daemons and provisioning scripts every time they rebase to master.

In this series, I'll be walking through how we do DevOps at HumanGeo. Our strategy consists of three phases - local development, continuous integration, and deployment.

Please note that, while I mention specific technologies, I'm not stating that this is The One True Way™. We encourage our teams to experiment with new tools and methods, so this series presents a model that several teams have implemented with success, not official developer guidelines.

Development Environment: Vagrant

In order to best capture external dependencies, one should start with a blank slate. Thankfully, this doesn't mean a developer has to format her computer each time she takes on a new project. Depending on the project, it may be as simple as putting code into a new directory or creating a new virtual environment. However, given the scale of the problems we tackle at HumanGeo, we need to push even further and assemble specific combinations of databases, Elasticsearch nodes, Hadoop clusters, and other bespoke installations. To do so, we need to create sandboxed instances of the aforementioned tools; it's the only sane way to juggle multiple versions of a product when developing locally. There are plenty of fine solutions to this problem, Docker and Vagrant being two of the major players. There's not a perfect overlap between the two, but as they fit in our stack, they're near-equivalent. Since it provides a gentler learning curve, this series will cover Vagrant.

Vagrant provides a means for creating and managing portable development environments. Typically, these reside in VirtualBox virtual machines, although they have support for many different backend providers. What's neat is that, with a single Vagrantfile, you can provision and connect multiple VMs, while automatically syncing code changes made on the host machine (i.e., your computer) to the guest instance (i.e., the Vagrant box).

To get started with Vagrant, you must define your configuration in a Vagrantfile. Here's a sample:

Vagrant.configure("2") do |config|
  config.vm.box = "trusty64"
  config.vm.hostname = "webserver"
  config.vm.network :private_network, ip: ""

  config.vm.provider :virtualbox do |vb|
    vb.customize [
      "modifyvm", :id,
      "--memory", "256",

  config.vm.provision :shell, path: "bootstrap.sh"

This defines an Ubuntu 14.04 (Trusty Tahr) machine with a fixed private IP, 256mb of RAM, and a bootstrap shell script, which will install needed dependencies and apply software-level configuration. The Vagrantfile can be committed to version control alongside the bootstrap script and your application code so the entire environment can be captured in a single snapshot.

Launching the machine is done with a single command: vagrant up. Vagrant will download the trusty64 base image from a central repository, launch a new instance of it with the hardware and networking states we've defined, and then run the bootstrap file. The image download will only occur once-per image, so future machine initializations will utilize the cached version. Machines can be stopped with vagrant down. You can later re-launch the machine with vagrant up. If you decide that you need to nuke your entire environment from orbit and start over (an immensely useful option), you can do so with vagrant destroy.

To manage these machines, one can connect via SSH just as one would a remote server. The vagrant ssh command will automatically log the user in using public key authentication. From there, a developer can experiment with configuration and other aspects of application development. All ports are exposed to the host machine, so, if a webserver is bound to port 5000, it can be reached from your browser at (the IP address we assigned to our instance in the Vagrantfile).

Unlike when working with a remote server, you don't need to run a terminal-based editor via SSH, or use rsync every time you save a file in order to make changes to the code on the virtual machine. Instead, the directory that contains your Vagrantfile is automatically mounted as /vagrant/ on the guest, with changes automatically synced back and forth. So, you can use whatever editor you want on the host, while executing code on the VM. Easy.

Provisioning: Ansible

Vagrant itself is only really focused on the orchestration of virtual machines; the configuration of the machines is outside of its purview. As such, it relies on a provisioner - an external tool or script that runs against newly created virtual machines in order to build upon the base image. For example, a provisioner would be responsible for taking a blank Ubuntu installation and installing PostgreSQL, initializing a database, and seeding the database with data.

The example Vagrantfile uses a simple shell script (bootstrap.sh) to handle provisioning. For simple cases, this may well be sufficient. However, if you're doing any serious development, you'll want to move to a more robust configuration management tool. Vagrant ships with support for several different ones, including our preferred tool - Ansible.

Ansible is great in many ways: its YAML-based configuration language is clean and logical, it operates over SSH, has a great community, emphasizes modularity, and doesn't require any custom software be present on your target computers (other than Python 2, with Python 3 support in the technical preview phase). With a little elbow grease, you can even make it idempotent, so there's nothing to fear if you reprovision an instance. Since these provisioning scripts live alongside your code, they can be included in your merge review process, and improve validation of your infrastructure.

Swapping out Vagrant's shell provisioner is extremely straightforward. Just change your provisioner to "ansible", point it at the Ansible configuration script (called a playbook), and you're set! The final provisioning block should now look like this:

config.vm.provision "ansible" do |ansible|
  ansible.playbook = "playbook.yml"


Ansible's basic building block is a task. Conceptually, a task is an atomic operation. These operations run the gamut from the basic (e.g., set the permissions on a file) to the complex (e.g., create a database table). Here's a sample task:

- name: Install database
  apt: name=mysql-server state=present

The equivalent shell command would be sudo apt-get install mysql-server. Nothing fancy, right?

- name: Deploy DB config
  copy: src=mysql.{{env_name}}.conf dest=/etc/mysql.conf mode=644

There are several things going on here. First, surprise! Ansible is awesome and speaks Jinja2. As such, it will interpolate the variable env_name into the string value for src, resulting in mysql.dev.conf if we were targeting a dev environment (env_name is a convention we use internally for this very purpose). Next, we're invoking the copy module. This doesn't actually copy a file from one remote location to another, it instead copies a local file to a remote destination. This saves you from having to scp the file to your target machine, then remote in to set a permission. It's also far easier to understand at a glance.

- name: Start mysqld
  service: name=mysql state=started enabled=yes

Finally, we ensure that the MySQL service is not only running, but is set to automatically start when the system does. This highlights one of the benefits of Ansible's module system - it masks (and handles) underlying implementation complexities. Whether or not the target machine is using SysV-style inits, Upstart, or systemd, the service module takes care of it for you.


Tasks can either reside in your playbook, or they can be organized into functional units called roles. Roles not only allow you to group tasks, but also bundle files, templates and other resources, providing for a clean separation of concerns. The tasks above can be placed in a file called tasks/main.yml, resulting in the following directory structure:

└── mysql                   # Tasks to be carried out on DB machines
    ├── files
    │   ├── mysql.dev.conf
    │   └── mysql.prod.conf
    └── tasks
        └── main.yml

Then, all you need to do is reference the role from within your playbook.


These are the entry points for Ansible. A playbook is comprised of one or more plays, each of which possesses several parameters: one or more instances to target, variables to bundle, and a series of tasks (or roles) to execute.

- name: Configure the test environment app server
    env_name: dev
    es_version: 2.1.0
    - common
    - elasticsearch
    - mysql

What is evident in the above example is how Ansible roles help improve modularity and reusability. If I have to install MySQL on several different hosts (e.g., the test app server and the production app server), all I need to do is include the role. Ansible maintains a central repository of roles for developers to customize; most of the time you don't need to write any novel provisioning code.

To invoke the playbook, run ansible-playbook name-of-playbook.yml. If you're using Ansible with Vagrant, you should instead use vagrant provision, as Vagrant will handle the mapping of hosts and authentication. And, no matter how many times you provision, the machine state should remain the same.

This concludes the local development portion of our dive into DevOps. In the next installment, we'll cover continuous integration!

Switching to NeoVim (Part 2)

2016-11-03 Update: Now using the XDG-compliant configuration location.

Now that my initial NeoVim configuration is in place, I'm ready to get to work, right? Well, almost. In my excitement to make the leap from one editor to another, I neglected a portion of my attempt to keep Vim and NeoVim isolated - the local runtimepath (usually ~/.vim).

"But Aru, if NeoVim is basically Vim, shouldn't they be able to share the directory?" Usually, yes. But I anticipate, as I start to experiment with some of the new features and functionality of NeoVim, I might add plugins that I want to keep isolated from my Vim instance.

I'd like Vim to use ~/.vim and NeoVim to use ~/.config/nvim. Accomplishing this is simple - I must first detect whether or not I'm running NeoVim and base the root path on the outcome of that test:

if has('nvim')
    let s:editor_root=expand("~/.config/nvim")
    let s:editor_root=expand("~/.vim")

With the root directory in a variable named editor_root, all that's left is a straightforward find and replace to convert all rtp references to the new syntax.

e.g. let &rtp = &rtp . ',.vim/bundle/vundle/'let &rtp = &rtp . ',' . s:editor_root . '/bundle/vundle/'

With those replacements out of the way, things almost worked. Almost.

I use Vundle. I think it's pretty rad. My vimrc file is configured to automatically install it and download all of the defined plugins in the event of a fresh installation. The first time I launched NeoVim with the above changes didn't result in a fresh install - it was still reading the ~/.vim directory's plugins.

Perplexed, I dove into the Vundle code. Sure enough, it appears to default to installing plugins to $HOME/.vim if a directory isn't passed in to the script initialization function. It appears that I was reliant on this default behavior. Thankfully, this was easily solved by passing in my own bundle path:

call vundle#rc(s:editor_root . '/bundle')

And with that, my Vim and NeoVim instances were fully isolated.

Switching to NeoVim (Part 1)

2016-11-03 Update: Now using the XDG-compliant configuration location.

NeoVim is all the rage these days, and I can't help but be similarly enthused. Unlike other editors, which have varying degrees of crappiness with their Vim emulation, NeoVim is Vim.

If it's Vim, why bother switching? Much like all squares are rectangles, but not all rectangles are squares, NeoVim has a different set of aspirations and features. While vanilla Vim has the (noble and important) goal of supporting all possible platforms, that legacy has seemingly held it back from eliminating warts and adding new features. That's both a good thing and a bad thing. Good because it's stable, bad because it can lead to stagnation. A Vim contributor, annoyed with how the project was seemingly hamstrung by this legacy (with its accompanying byzantine code structure, project structure, and conventions), decided to take matters into his hands and fork the editor.

The name of the fork? NeoVim.

It brings quite a lot to the table, and deserves a blog post or two in its own right. I'll leave the diffing as an exercise to the reader. I plan on writing about some of those differences as I do more with the fork's unique features.

So, what did I need to do to switch to NeoVim? I installed it. On Kubuntu, all I needed to do was add a PPA and install the neovim package (and its Python bindings for full plugin support).

$ sudo add-apt-repository -y ppa:neovim-ppa/unstable
$ sudo apt-get update && sudo apt-get install -y neovim
$ pip install --user neovim

Next up, configuration - one of Vim's great strengths. I dutifully keep a copy of my vimrc file on GitHub, and deploy it to any workstation I use for prolonged periods of time. It'd be nice if I could carry it over to NeoVim.

Suprise! It Just Works™. Remember, NeoVim is Vim, as such it shares the same configuration syntax. Since I don't think I'm doing anything too crazy in my vimrc, it should be a drop-in operation.

$ mkdir -p ~/.config/nvim
$ ln -s ~/.vimrc ~/.config/nvim/init.vim

After that, it's a simple matter of invoking nvim from the command line. Everything loaded and worked for me from the first run!


This pleasant detour over, I went to resume lolologist development. However, when I activated my virtual environment and fired up nvim, I got a message stating:

No neovim module found for Python 2.7.8. Try installing it with 'pip install neovim' or see ':help nvim-python'.

Hm. That's strange. The relevant help docs, however, tell us all we need to know - the Python plugin needs to be discoverable in our path, and, since I'm using a virtual environment, a different Python instance is being used. This is easily addressed, as detailed in that help doc. However, since I use this vimrc file on two platforms (Linux & OS X), I need to be a little smarter about hardcoding paths to Python executables. I added this to my vimrc (it shouldn't negatively impact my Vim use, so it's fine to be in a shared configuration).

if has("unix")
  let s:uname = system("uname")
  let g:python_host_prog='/usr/bin/python'
  if s:uname == "Darwin\n"
    let g:python_host_prog='/usr/local/bin/python' # found via `which python`

Restarting NeoVim with that configuration block in place let it find my system Python and all associated plugins.

I'll keep this site updated with any new discoveries and NeoVim experiments! I'm quite eager to see how the client-server integrations flesh out.

I've written more on this! Part 2.

Supercharging your Reddit API Access

This was originally posted to a former employer's blog. It has been mirrored here for posterity.

Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we've recently started working with is reddit.

Compared to the walled gardens of Facebook and LinkedIn, reddit's API is as open as open can be; Everything is nice and RESTful, rate limits are sane, the developers are open to enhancement requests, and one can do quite a bit without needing to authenticate. The most common objects we collect from reddit are submissions (posts) and comments. A submission can either be a link, or a self post with a text body, and can have an arbitrary number of comments. Comments contain text, as well as references to parent nodes (if they're not root nodes in the comment tree). Pulling this data is as simple as GET http://www.reddit.com/r/washingtondc/new.json. (Protip: pretty much any view in reddit has a corresponding API endpoint that can be generated by appending .json to the URL.)

With little effort a developer could hack together a quick 'n dirty reddit scraper. However, as additional features appear and collection-breadth grows, the quick 'n dirty scraper becomes more dirty than quick, and you discover bugsfeatures that others utilizing the API have already encountered and possibly addressed. API wrappers help consolidate communal knowledge and best practices for the good of all. We considered several, and, being a Python shop, settled on PRAW (Python Reddit API Wrapper).

With PRAW, getting a list of posts is pretty easy:

import praw
r = praw.Reddit(user_agent='Hello world application.')
for post in r.get_subreddit('WashingtonDC') \
$ python parse_bot_2000.py
209 :: /r/WashingtonDC's Official Guide to the City!
29 :: What are some good/active meetups in DC that are easy to join?
17 :: So no more tailgating at the Nationals park anymore...
3 :: Anyone know of a juggling club in DC
2 :: The American Beer Classic: Yay or Nay?</pre>

The Problem

Now, let's try something a little more complicated. Our mission, if we choose to accept it, is to capture all incoming comments to a subreddit. For each comment we should collect the author's username, the URL for the submission, a permalink to the comment, as well as its body.

Here's what this looks like:

import praw
from datetime import datetime
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \

That was pretty easy. For the sake of this demo the save_comment method has been stubbed out, but anything can go there.

If you run the snippet, you'll observe the following pattern:

... comment ...
... comment ...
... comment ...
... comment ...
... comment ...
... comment ...

This process also seems to be taking longer than a normal HTTP request. As anyone working with large amounts of data should do, let's quantify this.

Using the wonderful, indispensable iPython:

In [1]: %%timeit
1 loops, best of 3: 136 ms per loop

In [2]: %%timeit
import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
for comment in r.get_subreddit('Python') \
1 loops, best of 3: 6min 43s per loop

Ouch. While this difference in run-times is fine for a one-off, contrived example, such inefficiency is disastrous when dealing with large volumes of data. What could be causing this behavior?


According to the PRAW documentation,

Each API request to Reddit must be separated by a 2 second delay, as per the API rules. So to get the highest performance, the number of API calls must be kept as low as possible. PRAW uses lazy objects to only make API calls when/if the information is needed.

Perhaps we're doing something that is triggering additional HTTP requests. Such behavior would explain the intermittent printing of comments to the output stream. Let's verify this hypothesis.

To see the underlying requests, we can override PRAW's default log level:

from datetime import datetime
import logging
import praw


r = praw.Reddit(user_agent='Subreddit Parse Bot 2000')

def save_comment(*args):
    print(datetime.now().time(), args)

for comment in r.get_subreddit('Python') \

And what does the output look like?

DEBUG:requests.packages.urllib3.connectionpool:"PUT /check HTTP/1.1" 200 106
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ak14j.json HTTP/1.1" 200 888
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aies0.json HTTP/1.1" 200 2889
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2aiier.json HTTP/1.1" 200 14809
.. comment ..
DEBUG:requests.packages.urllib3.connectionpool:"GET /comments/2ajam1.json HTTP/1.1" 200 1091
.. comment ..
.. comment ..
.. comment ..

Those intermittent requests for individual comments back up our claim. Now, let's see what's causing this.

Prettifying the response JSON yields the following schema (edited for brevity):

               'title':'Should I? why?',

Lets compare that to what we get when listing comments from the /python/comments endpoint:

               'link_title':'Django middleware that prints query stats to stderr after each request. pip install django-querycount',
               'body':'Try django-devserver for query counts, displaying the full queries, profiling, reporting memory usage, etc. \n\nhttps://pypi.python.org/pypi/django-devserver',

Now we're getting somewhere - there are fields in the per-comment's response that aren't in the subreddit listing's. Of the four fields we're collecting, the submission URL and permalink properties are not returned by the subreddit comments endpoint. Accessing those causes a lazy evaluation to fire off additional requests. If we can infer these values from the data we already have, we can avoid having to waste time querying for each comment.

Doing Work

Submission URLs

Submission URLs are a combination of the subreddit name, the post ID, and title. We can easily get the post ID fragment:

post_id = comment.link_id.split('_')[1]

However, there is no title returned! Luckily, it turns out that it's not needed.

subreddit = 'python'
post_id = comment.link_id.split('_')[1]
url = 'http://reddit.com/r/{}/{}/' \
          .format(subreddit, post_id)
print(url) # A valid submission permalink!
# OUTPUT: http://reddit.com/r/python/2alblh/

Great! This also gets us most of the way to constructing the second URL we need - a permalink to the comment.

Comment Permalinks

Maybe we can append the comment's ID to the end of the submission URL?

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = sub_comments_url + comment_id
# OUTPUT: http://reddit.com/r/python/comments/2alblh/ciwbo37

Sadly, that URL doesn't work because reddit expects the submission's title to precede the ID. Referring to the subreddit comment's JSON object, we can see that the title is not returned. This is curious: why is the title important? They already have a globally unique ID for the post, and can display the post just fine without (as demonstrated by the code sample immediately preceding this). Perhaps reddit wanted to make it easier for users to identify a link and are just parsing a forward-slash delimited series of parameters. If we put the comment ID in the appropriate position, the URL should be valid. Let's give it a shot:

sub_comments_url = 'http://reddit.com/r/python/comments/2alblh/'
comment_id = comment.name.split('_')[1]
url = '{}-/{}'.format(sub_comments_url, comment_id)
# OUTPUT: http://reddit.com/r/python/comments/2alblh/-/ciwbo37

Following that URL takes us to the comment!

Victory Lap

Let's see how much we've improved our execution time:

import praw
r = praw.Reddit(user_agent='Subreddit Parse Bot 2000',
for comment in r.get_subreddit('Python') \
1 loops, best of 3: 3.57 s per loop

Wow! 403 seconds to 3.6 seconds - a factor of 111. Deploying this improvement to production not only increased the volume of data we were able to process, but also provided the side benefit of reducing the number of 504 errors we encountered during reddit's peak hours. Remember, always be on the lookout for ways to improve your stack. A bunch of small wins can add up to something significant.

Accessing Webcams with Python

So, I've been working on a tool that turns your commit messages into image macros, named Lolologist. This was a great learning exercise because it gave me insight into things I haven't encountered before - namely:

  1. Packaging python modules
  2. Hooking into Git events
  3. Using PIL (through Pillow) to manipulate images and text
  4. Accessing a webcam through Python on *nix-like platforms

I might talk to the first three at a later point, but the latter was the most interesting to me as someone who enjoys finding weird solutions to nontrivial problems.

Perusing the internet results in two third-party tools for "python webcam linux": Pygame & OpenCV. Great! Only problem is these come in at 10MB and 92MB respectively. Wanting to keep the package light and free of unnecessary dependencies, I set out to find a simpler solution...

Read more…