Command-line tools are fast; is Python faster?

March 21, 2016 - python dev dask data

A little while ago, I encountered a blog post that's stuck in my head ever since -- Adam Drake's experiment with command-line tools for data processing. Ever since then, I've wanted to do a little experimenting of my own. I finally got around to it this evening.

Instead of pulling out my shell-scripting skills, I turned to my preferred tool for doing any kind of data processing work - Python. I started out by replicating Drake's original dataset, looking up the chess dataset he used on GitHub, pulling it down, and flattening it into a single directory. (For anyone else who wants to mess with it, I've uploaded it to AWS). Then the (highly unscientific) experimenting could begin!

To establish an upper bound on how quickly we could process this data, I dumped the data to /dev/null:

$ time cat data/*.pgn > /dev/null

real    0m12.445s
user    0m0.066s
sys     0m2.426s

OK, that's faster than in Drake's two-year-old article, which is what we might expect given that I'm running this on a pretty beefy recent Macbook Pro. Alright, now let's see how fast Drakes awk-based approach handles it:

$ time find . -type f -name '*.pgn' -print0 | xargs -0 -n4 -P4 awk '/Result/ { split($0, a, "-"); res = substr(a[1], length(a[1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } END { print white+black+draw, white, black, draw }' | awk '{games += $1; white += $2; black += $3; draw += $4; } END { print games, white, black, draw }

9877256 3762840 2853647 3260769

real    2m0.802s
user    7m45.198s
sys     0m9.259s

Whoa, that's a lot slower than I expected! Just in case something's changed here, I tried one of the earlier versions in Drake's post:

$ time find . -type f -name '*.pgn' -print0 | xargs -0 -n1 -P4 grep -F "Result" | awk '{ split($0, a, "-"); res = substr(a[1], length(a [1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++;} END { print NR, white, black, draw }'

9878269 3762709 2853512 3260562

real    0m44.245s
user    2m53.862s
sys     0m7.786s

Well, that's more like it!

Now we can see where a little playing with Python gets us. I've heard a lot about the great dask library which provides a convenient API for doing large-scale data processing 1, so I spent a few minutes looking through the documentation for dask.bag. I implemented the same algorithm as in Drake's post -- filtering out only the result lines from all the .pgn files, then simply snipping the values out of it to decide whether or not the match represents a win, loss, or draw. Here's the short script that resulted (also in this gist):

import dask.bag

# The result of the match is found in the line formatted '[Result "W-B"]'
# where W and B are 1, 0, or 1/2 representing win, loss, or draw.

def only_result_lines(l):
    return l[1:7] == 'Result'

def extract_result_value(l):
    '''For each result value, return a simple key that tells us whether White
    won, Black won, or it was a draw -- W, B, or D. We can also return the value
    '-' which means we couldn't figure out that information.'''
    value = l[9:-4]
    results = value.split('-')
    if len(results) != 2:
        return '-'

    w, b = results
    if w == '1':
        return 'W'
    elif b == '1':
        return 'B'
    elif w == b:
        return 'D'
    else:
        return '-'


b = dask.bag.from_filenames("data/*.pgn", encoding='iso8859-1', linesep='\r\n')
result_lines = b.filter(only_result_lines)
result_values = result_lines.map(extract_result_value)
win_loss = result_values.frequencies()
result = win_loss.compute()

print("win-loss ratio:")
total = 0
for line in result:
    print("{}: {}".format(*line))
    total += line[1]
print("total games:", total)

Now let's run it over our dataset:

$ time python win-loss.py
win-loss ratio:
W: 2336058
B: 1758120
D: 2060063
-: 845
total games: 6155086

real    0m27.046s
user    2m12.004s
sys     0m11.080s

More than a third faster than the shell script approach -- and this is using the Bag general-purpose processing class, the slowest approach, in a language often scorned for being awfully slow. I'm impressed and pleased -- Python is my preferred toolkit, and dask will scale quite a ways. Step aside, showy shell scripts -- turns out plain old Python is the winner this time!

[1]: Full disclosure: although this blog post is nothing to do with my day job, I do work with Continuum Analytics, and heard about dask through that work -- after all, Python-powered big data is our thing.