This was originally posted to a former employer's blog. It has been mirrored here for posterity.
Here at HumanGeo we do all sorts of interesting things with sentiment analysis and entity resolution. Before you get to have fun with that, though, you need to bring data into the system. One data source we've recently started working with is reddit.
Compared to the walled gardens of Facebook and LinkedIn, reddit's API is as
open as open can be; Everything is
nice and RESTful, rate limits are sane, the developers are open to enhancement
requests, and one can do quite a bit without needing to authenticate. The most
common objects we collect from reddit are submissions (posts) and comments. A
submission can either be a link, or a self post with a text body, and can have
an arbitrary number of comments. Comments contain text, as well as references
to parent nodes (if they're not root nodes in the comment tree). Pulling this
data is as simple as
(Protip: pretty much any view in reddit has a corresponding API endpoint that
can be generated by appending
.json to the URL.)
With little effort a developer could hack together a quick 'n dirty reddit
scraper. However, as additional features appear and collection-breadth grows,
the quick 'n dirty scraper becomes more dirty than quick, and you discover
bugsfeatures that others utilizing the API have already
encountered and possibly addressed. API wrappers help consolidate communal
knowledge and best
practices for the good of all. We considered
several, and, being a
Python shop, settled on PRAW (Python Reddit API
So, I've been working on a tool that turns your commit messages into image macros, named Lolologist. This was a great learning exercise because it gave me insight into things I haven't encountered before - namely:
- Packaging python modules
- Hooking into Git events
- Using PIL (through Pillow) to manipulate images and text
- Accessing a webcam through Python on *nix-like platforms
I might talk to the first three at a later point, but the latter was the most interesting to me as someone who enjoys finding weird solutions to nontrivial problems.
Perusing the internet results in two third-party tools for "python webcam linux": Pygame & OpenCV. Great! Only problem is these come in at 10MB and 92MB respectively. Wanting to keep the package light and free of unnecessary dependencies, I set out to find a simpler solution...
It was your run of the mill work day: I was trying to close out stories from a jam-packed sprint, while QA was doing their best to break things. Our test server was temporarily out of commission, so we had to run a development server in multithreaded mode.
One bug came across my inbox, flagging a feature that I had developed. Specifically a dropdown that displayed a list of saved user content wasn't updating. I attempted to replicate the issue on my machine, no-luck. Fired up my IE VM to test and, yet again, no luck. Weird.
I walked over to the tester's desk and asked her to walk me through the bug. Sure enough, upon saving an item, the contents of the dropdown did not update. Stumped, I returned to my machine and spoofed her account, wondering if there was an issue affecting her account within our mocked database backend (read: flat file). I was able to see the "missing" saved content. Now I was getting somewhere.
I walked back to her machine, opened the F12 Developer Tools, and watched the network traffic. The GET for that dynamically-populated list was resulting in a 304 status, and IE was using a cached version of the endpoint. Of course, I facepalmed as this was something I've not normally had to deal with as I usually use a cachebuster. However, for this new application, my team has been trying to do things as cleanly as possible. Wanting to keep our hack count down, I went back to my machine to see what our web framework could do.
Coming from a four-year stint in ASP.NET land, I was used to being able to set caching per endpoint via the
[OutputCache(NoStore = true, Duration = 0, VaryByParam = "None")] ActionAttribute. Sadly, there didn't appear to be something similar in Flask. Thankfully, it turned out to be fairly trivial to write one.
I found this post from Flask creator Armin Ronacher that set me on the right track, but the snippet provided didn't work for all the Internet Explorer versions we were targeting. However, I was able to whip together a decorator that did:
To invoke it, all you need to do is import the function and apply it to your endpoint.
from nocache import nocache @app.route('/my_endpoint') @nocache def my_endpoint(): return render_template(...)
I took this back to QA and was met with success. The downside of this manual implementation, however, is one needs to be religious in applying it. To this day we still stumble upon cases where a developer forgot to add the decorator to a JSON endpoint. Thankfully, code review processes are perfect for catching that sort of omission.
Flask is awesome. It's lightweight enough to disappear, but extensible enough to be able to get some niceties such as auth and ACLs without much effort. On top of that, the Werkzeug debugger is pretty handy (not as nice as wdb's, though).
Things were going swimmingly until our QA server went down. While the server may have stopped, development didn't, and we needed a way to get testable builds up and running for QA. One of my fellow developers quickly stood up an instance of our application on a lightly-used box that was nearing the end of its usefulness. To get around the fact that the machine wasn't outfitted for httpd action, the developer just backgrounded an instance of the Flask development server and established an SSH tunnel for QA to use. This was deemed acceptable by our QA team of one. We rejoiced and went back to work.
Days passed and our system engineer's backlog remained dangerously full, with the QA server remaining a low priority. Thankfully, aside from having to restart the process a few times, the Little Server That Could kept on chugging. But then disaster struck - we hired another tester! Technically her joining was a great thing; when you bring in fresh blood you get not only another person working the trenches but also an influx of fresh ideas. However, her arrival brought some unforeseen trouble - the QA server would get more than one user! This wouldn't normally be an issue, but at the moment our QA server was nothing more than a single threaded developer script running as an unmanaged process. Oops.
Sure enough the complaints from QA started bubbling forth: "Test is down!", "Test isn't loading!", "Test is unusable!" To preserve harmony in the office, and to hold to our end of the developer-tester bargain, we had to do something (the sysengineer was busy rebuilding a dead Solr cluster). We needed to buy some time, so I started investigating what we could do with what we already had - maybe there was a way to handle the load with the development server.
Flask uses the Werkzeug WSGI library to manage its server, so that would be a good place to start. Sure enough, the docs state that the
WSGIref wrapper accepts a parameter specifying whether or not the it should use a thread per request. Now, how to invoke this from within Flask?
This, much like with everything else in the framework, was surprisingly easy. Flask's app.run function accepts an
option collection which it passes on to Werkzeug's
run_simple. I updated our dev server startup script...
... and then threw it back over the wall to QA. I moseyed on over to test-land shortly thereafter to discover them chipping away at a backlog of tasks, with the webserver serving away as if nothing happened.
A few weeks weeks later, the new test server is almost ready for prime time, and our multithreaded Flask server (which should never be used for any sort of production purposes) is still holding down the fort.
We've been using the wonderful wdb WSGI middleware tool at work to aid in debugging our Flask app. The features it brings to the table have helped immensely with development, and it has quickly established itself as a integral part of my toolbox.
One minor issue I had with it, though, was the fact that one needs to manually start the
wdb.server.py script before launching a development server. If only I could start and stop wdb whenever Flask did. Some Google-fu introduced me to bash's signal trapping features. After some experimentation, I devised the following wrapper script.
This has been working quite reliably for us. Here's hoping it helps you!