Command-line tools are fast; is Python faster?
March 21, 2016 -
A little while ago, I encountered a blog post that's stuck in my head ever since -- Adam Drake's experiment with command-line tools for data processing. Ever since then, I've wanted to do a little experimenting of my own. I finally got around to it this evening.
Instead of pulling out my shell-scripting skills, I turned to my preferred tool for doing any kind of data processing work - Python. I started out by replicating Drake's original dataset, looking up the chess dataset he used on GitHub, pulling it down, and flattening it into a single directory. (For anyone else who wants to mess with it, I've uploaded it to AWS). Then the (highly unscientific) experimenting could begin!
To establish an upper bound on how quickly we could process this data, I dumped the data to /dev/null
:
OK, that's faster than in Drake's two-year-old article, which is what we might
expect given that I'm running this on a pretty beefy recent Macbook Pro.
Alright, now let's see how fast Drakes awk
-based approach handles it:
| |
Whoa, that's a lot slower than I expected! Just in case something's changed here, I tried one of the earlier versions in Drake's post:
| |
Well, that's more like it!
Now we can see where a little playing with Python gets us. I've heard a lot
about the great dask library which
provides a convenient API for doing large-scale data processing
1, so I spent a few minutes looking
through the documentation for
dask.bag
. I implemented the same
algorithm as in Drake's post -- filtering out only the result lines from all the
.pgn
files, then simply snipping the values out of it to decide whether or not
the match represents a win, loss, or draw. Here's the short script that resulted
(also in this gist):
# The result of the match is found in the line formatted '[Result "W-B"]'
# where W and B are 1, 0, or 1/2 representing win, loss, or draw.
return ==
'''For each result value, return a simple key that tells us whether White
won, Black won, or it was a draw -- W, B, or D. We can also return the value
'-' which means we couldn't figure out that information.'''
=
=
return
, =
return
return
return
return
=
=
=
=
=
= 0
+=
Now let's run it over our dataset:
More than a third faster than the shell script approach -- and this is using the
Bag
general-purpose processing class, the slowest approach, in a language
often scorned for being awfully slow. I'm impressed and pleased -- Python is my
preferred toolkit, and dask
will scale quite a
ways.
Step aside, showy shell scripts -- turns out plain old Python is the winner this
time!
[1]: Full disclosure: although this blog post is nothing to
do with my day job, I do work with Continuum
Analytics, and heard about dask
through that work
-- after all, Python-powered big data is our thing.