My actual blog is hosted on Dreamwidth, but the last few public posts are syndicated here.

profile pic

ooh, interesting!

last updated 22 Mar 2016

Command-line tools are fast -- is Python faster?

A little while ago, I encountered a blog post that's stuck in my head ever since -- Adam Drake's experiment with command-line tools for data processing. Ever since then, I've wanted to do a little experimenting of my own. I finally got around to it this evening.

Instead of pulling out my shell-scripting skills, I turned to my preferred tool for doing any kind of data processing work -- Python. I started out by replicating Drake's original dataset, looking up the chess dataset he used on GitHub, pulling it down, and flattening it into a single directory. (For anyone else who wants to mess with it, I've uploaded it to AWS). Then the (highly unscientific) experimenting could begin!

To establish an upper bound on how quickly we could process this data, I dumped the data to /dev/null:

$ time cat data/*.pgn > /dev/null

real    0m12.445s
user    0m0.066s
sys     0m2.426s

OK, that's faster than in Drake's two-year-old article, which is what we might expect given that I'm running this on a pretty beefy recent Macbook Pro. Alright, now let's see how fast Drake's awk-based approach handles it:

$ time find . -type f -name '*.pgn' -print0 | xargs -0 -n4 -P4 awk '/Result/ { split($0, a, "-"); res = substr(a[1], length(a[1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } END { print white+black+draw, white, black, draw }' | awk '{games += $1; white += $2; black += $3; draw += $4; } END { print games, white, black, draw }'
9877256 3762840 2853647 3260769

real    2m0.802s
user    7m45.198s
sys     0m9.259s

Whoa, that's a lot slower than I expected! Just in case something's changed here, I tried one of the earlier versions in Drake's post:

$ time find . -type f -name '*.pgn' -print0 | xargs -0 -n1 -P4 grep -F "Result" | awk '{ split($0, a, "-"); res = substr(a[1], length(a [1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++;} END { print NR, white, black, draw }'
9878269 3762709 2853512 3260562

real    0m44.245s
user    2m53.862s
sys     0m7.786s

Well, that's more like it!

Now we can see where a little playing with Python gets us. I've heard a lot about the great dask library which provides a convenient API for doing large-scale data processing1, so I spent a few minutes looking through the documentation for dask.bag. I implemented the same algorithm as in Drake's post -- filtering out only the result lines from all the .pgn files, then simply snipping the values out of it to decide whether or not the match represents a win, loss, or draw. Here's the short script that resulted (also in this gist):

import dask.bag

# The result of the match is found in the line formatted '[Result "W-B"]'
# where W and B are 1, 0, or 1/2 representing win, loss, or draw.

def only_result_lines(l):
    return l[1:7] == 'Result'

def extract_result_value(l):
    '''For each result value, return a simple key that tells us whether White
    won, Black won, or it was a draw -- W, B, or D. We can also return the value
    '-' which means we couldn't figure out that information.'''
    value = l[9:-4]
    results = value.split('-')
    if len(results) != 2:
        return '-'

    w, b = results
    if w == '1':
        return 'W'
    elif b == '1':
        return 'B'
    elif w == b:
        return 'D'
    else:
        return '-'


b = dask.bag.from_filenames("data/*.pgn", encoding='iso8859-1', linesep='\r\n')
result_lines = b.filter(only_result_lines)
result_values = result_lines.map(extract_result_value)
win_loss = result_values.frequencies()
result = win_loss.compute()

print("win-loss ratio:")
total = 0
for line in result:
    print("{}: {}".format(*line))
    total += line[1]
print("total games:", total)

Now let's run it over our dataset:

$ time python win-loss.py
win-loss ratio:
W: 2336058
B: 1758120
D: 2060063
-: 845
total games: 6155086

real    0m27.046s
user    2m12.004s
sys     0m11.080s

More than a third faster than the shell script approach -- and this is using the Bag general-purpose processing class, the slowest approach, in a language often scorned for being awfully slow. I'm impressed and pleased -- Python is my preferred toolkit, and dask will scale quite a way. Step aside, showy shell scripts -- turns out plain old Python is the winner this time!

 

P.S. Full disclosure: although this blog post is nothing to do with my day job, I do work at Continuum Analytics, and heard about dask at work -- after all, Python-powered big data is our thing.



comment count unavailable comments

interludes in d.c.

A lot has happened since my last post. I've settled into my work at Social Tables, where we've been making substantial strides on a weekly basis. I've moved into my own apartment here in the District, and am slowly building up the furniture and so on to make it a home. I've met some interesting people, from a variety of walks of life. And I've reconnected with old friends who I hadn't seen in a long, long time. So on the whole things are good.

comment count unavailable comments

another day, another dollar

In my last post I speculated about whether it was worth sticking around in the UK, against the current of immigration laws and startup stress. And, as these things tend to do, the situation came to a head -- and I left. Ultimately, my staying in the UK and continuing on the startup wasn't going to work.

About a month ago (when my UK visa expired) I returned to the US, and while continuing to work on Esplorio remotely I started looking for a regular job. As of Monday I'll be joining the team at Social Tables in Washington, D.C. — a startup that's bringing 21st-century technology to the event-planning industry.

I'm ridiculously fortunate to be in the industry I'm in — my specific skill-set is very in demand right now[1]. So the fact that I've been able to find employment so quickly isn't as much of a surprise as it could be.

What is surprising is that I've managed to find a great company so quickly. Social Tables has a lot of fascinating challenges to work on, both technically and in disrupting an entrenched industry, that I'm keen to get started on. Most importantly, it has a compelling set of core values and a great team that I'll be working with.

Anyway, look forward to more posts from my new life at a startup on this side of the pond!

[1]: I'm not nearly as sought-after as the developer in the article, though.

comment count unavailable comments

a place to call my own

“You have a personal website?” I hear you ask. “Isn’t that so 1999?”

Sure, it's no longer the first place most people would find me on the Internet (not with Twitter, LinkedIn, GitHub, Facebook, About.me...) but I like having a corner of the Web that's mine. I've had it since I was at university, and when I was doing a check of my online presence in advance of applying to the startup accelerator I was reminded that it was still the slightly rag-tag pile of PHP scripts and Javascript that I wrote when I still had my web development training wheels on.

I've also been hankering to try doing something in Go (probably the most difficult-to-search-for project Google's ever done) -- I took the tour, but hadn't found anything to use it for. After a bout of particularly frustrating work on Esplorio's nascent iPhone app I decided to take a break and sit down to some Go code after all. So I rewrote my website using Revel, a Go web framework.

My personal site is pretty simple -- mostly templated content, like my about page or a display of images I composed years ago. But it does regularly pull the RSS feed of my blog from Dreamwidth, to display my latest public blog post on the home page and provide a blog page on the site itself. I was using LastRSS to do this in PHP, but couldn't find a Go library that would give me the same transparent caching ability. So I took the excuse to learn a bit more Go and wrote one.

I'm quite impressed with my first forays -- Go is easy to get started with, well-documented, and has comprehensive built-in tools. It's also got some very interesting concurrency primitives and is fairly fast. I'm definitely going to look for more things to do with Go.



comment count unavailable comments

deadline passed. application in. luck needed.

Yesterday was the deadline for applying to the world's pre-eminent startup accelerator: Y Combinator. We (Esplorio) have applied, and... well, my fingers are crossed. I think we've got a great vision and product, but I don't have a meaningful way of judging our chances. So I won't jinx it.

It was actually a very useful process to go through just to fill out and refine the application. Its constraints -- direct questions with a very small suggested word limit -- made us articulate our message and our vision concisely. Considering that my co-founder and I both have tendencies to ramble, this was very helpful.

Thanks of course to all our friends who read our constantly-evolving drafts and gave us feedback, but also thanks to Y Combinator's guidance, Dropbox's sample application for some samples of what (and what not) to do. Just as importantly, thanks to fairly recent YC alumni Lollipuff, Seeing Interactive, and Standard Treasury for sharing their experiences and demystifying things a little bit.

In the interests of transparency and of helping other startups find their way, I'll be posting about how we do -- even if it's an analysis of what we did wrong. But let's hope it doesn't come to that ;-) Wish us luck!



comment count unavailable comments

on abandoned projects and updates

Many, many years ago (OK, just 7 or so, but that's a long time in software, OK?) I wrote a set of CGI scripts for a friend of mine as the basis for her idea for a funky e-zine.

Needless to say, those early scripts were messy and difficult to maintain (they were CGI scripts!), and so I spent a few months a year or two later upgrading the whole business to something slightly less appalling. (In the process, I wrote my own standards-compliant web application framework. It's terrible -- please don't use it.)

I haven't updated the code in years -- I've done bits and pieces, fixed bugs here and there, added a feature or two, but that's about it. I even started a full-scale upgrade a year and a half ago, but that ran out of steam when I realized I didn't have time to shave enough yaks (well, mostly one big and hairy one).

As you can see from the sporadic activity on the project, I'd love to get back into it, finish the upgrades, implement a bunch of ideas, and make it a better and more modern site. But it's running pretty well as it is, apart from the occasional need for a server restart (thank you, WebFaction), and there are always so many shiny things to try out.

Is it worth finishing this particular job? How do I stay motivated to do it? Anything you'd like to see in the new version, when I get round to it?

comment count unavailable comments

"Porn Blocking" in the UK

Recently, here in the UK, Prime Minister David Cameron embarked on a deliberately well-publicized crusade against the evil that is online pornography. He'd like to do this by requiring UK ISPs to filter out a government-defined set of porn from the Internet their customers can access, unless those customers have deliberately opted out of filtering.

Of course, Cameron is appealing to the voters to think of the children. Think of the innocent children that will be spared the trauma of stumbling across porn on their family network connection. Think, more importantly, of the child pornography that will be kept at bay.

Mic Wright at the Telegraph explains very neatly how facile this is. A point he makes, but perhaps doesn't emphasize enough, is how this trivializes and devalues the serious work of law enforcement officials and engineers around the world, who do understand the Internet, to actually find and remove child abuse imagery and catch its creators. Another point he doesn't emphasize is that any system that's designed for the average (read: lowest-common-denominator) UK adult to be able to opt out of will be trivial for children with even a modicum of intelligence to get around.

(Update: Paul Bernal asks the right questions about this initiative)

So why am I actually bothering to post about it at all? I know this is political hand-waving, and there's a good chance Cameron won't actually follow through with more than a token effort at implementing and enforcing this. I know it's not much more than publicity designed to attract the "Daily Mail vote". And heck, there are clever responses to the problem from people actually on the front lines of this, who run ISPs.

No, what angers me is not specific to this latest round of posturing. What makes me angry and sad in equal measure is that these episodes are all too common: people attempting to regulate things they don't understand (and/or outright refuse to attempt to understand).

I've worked for years to get to a position where I know a little -- enough to know how little I really know -- about everything involved, and it's incredibly frustrating to see leaders reveling in their ignorance like this.

Any tips for how I can help get people (ideally, the right people) to listen?

comment count unavailable comments

CouchDB to Couchbase -- Ultimately, a Tale of JSON

At Esplorio we've worked with a number of databases (we're currently using four different ones, for various things), so we know a bit about moving data around. When we realized we needed to upgrade from vanilla CouchDB to Couchbase we had psyched ourselves up to deal with migration tools, multi-gigabyte binary dumps, and hours of painstakingly rebuilding indexes.

We were spoiled, you see, by CouchDB's fantastic replication technology. Zero-downtime upgrades of virtual hardware, keeping a live backup of our production data on a staging server -- these were all a couple of clicks worth of effort in a CouchDB system. And though Couchbase is based on CouchDB technology, and has a replication system too, the two systems' replication protocols don't quite match up.

Well, OK, but we're still talking about two open-source JSON document stores. With a background of experience with CouchDB, I was confident I could get it to replicate to a custom endpoint that would do a little bit of data massaging to get our data into Couchbase without any interruptions.

This was the first draft of that -- a simple custom endpoint, written as a Python (2.x) web application that depends only on the standard library and the Couchbase Python driver. Because CouchDB's replication protocol is slightly more complex than I expected, and what little documentation there is is well out of date, this took much longer than expected. And it doesn't work properly -- even on the mere 200,000 documents in my local dev database it took over an hour to replicate all the documents, and it was still trying to transfer missing revisions after all that.

Now, a proper zero-downtime data migration strategy should look like that. Taking advantage of CouchDB's killer replication features is clearly the way to go. It makes a solution robust against failures on either end, and ensures complete and battle-tested transfer of database state. So if anyone finds my first draft useful, please feel free to improve it -- it's available under the BSD license.

Getting Esplorio's data migrated, however, was a more pressing priority. And as the Zen of Python points out, practicality beats purity. With a bit of confirmation from the helpful folks in #couchbase on Freenode IRC I relied on the simple fact that with two JSON document stores, one could simply slurp documents out of one and insert them into the other, and threw together this quick script.

It has no optimizations beyond retrieving documents in batches from CouchDB, and doesn't transfer revision histories or deleted-document "tombstones" at all. But we didn't care about that at the time, and maybe you don't either. What mattered to us was that it ripped through my dev database in a handful of minutes, and took just over fifteen minutes to migrate the million-and-a-half documents in production. So we were able to run it in the dead of night, swap over the code, and have the database switched in the time it takes to restart a web server process.

My conclusions from all of this? Nothing spectacular -- just a validation of the old Agile principle that the solution should be the simplest thing that could possibly work.



comment count unavailable comments