Did a presentation on my new Python library zoomascii at the San Diego Python Meetup. Check out the slides.
Python Meetup Presentation – Disk Spooling!
I gave a lightning talk at the SD Python Users Group meetup last night about adding disk-spooling to a Django query using generators and multi-processing. Check out the slides here:
Remote Fix for a Busted Keyboard
When I lived in New York, I was a volunteer for Big Brothers Big Sisters. One of the ways I helped out my little brother was by helping him keep his computer running – a Windows 7 PC that I put together for him. This has gotten harder now that I live on the west coast, but I still want to help him if I can. Typically when he has a problem I remote in with TeamViewer and fix it.
A week ago he wrote to me telling me his keyboard was broken. I figured he’d spilled something on it so I advised him to try another keyboard – I knew he had a spare. He told me that one was the same, and went into more detail – neither keyboard was completely broken, the windows key and media keys worked, but he couldn’t type any letters or numbers.
After several long sessions of debugging via TeamViewer I had the following symptoms:
- Unable to type letters or numbers, but the keyboard otherwise worked.
- Drivers were fine, devices appeared correct in Device Manager.
- Switching to a PS/2 keyboard didn’t help.
- The problem persisted in Safe Mode.
- The visual keyboard worked and I could type when connected through TeamViewer.
I was about ready to give up when I thought to press him a little about what he was doing when the keyboard stopped working. Turns out he was trying to hack an online game – he hadn’t told me out of embarrassment I imagine. Now I had a pretty good idea what had probably happened – he’d run a downloaded hack that contained malicious code. I ran a few malware scanners and they didn’t find anything.
I did, however, have the hack itself, so out of complete desperation I opened it up in Emacs hexl-mode to take a look. It was a compiled Windows binary but there it was, hidden in among the compiled code:
System\CurrentControlSet\Control\Keyboard Layout
That looked like a registry key and sure enough it was! I loaded up regedit, found that key and deleted it, rebooted and he was typing again!
I’m writing this blog post for a couple reasons – 1) I’m super proud of figuring this out and 2) when someone else has a similar problem maybe Google will serve up this post and they’ll be saved a lot of trouble. I searched a lot and never saw any mention of this registry key!
The Upload System of My Dreams
Sometimes I have a great job, a job where I get to do exactly what I want in exactly the way I want to do it. And what is it, you might ask, that I want to do? I want to build the perfect data upload system. Why? A few reasons:
- It’s not an easy problem. Data upload is complicated by the fact that the most common format we support (CSV) isn’t even close to standardized. Line-endings, character sets and quoting are all likely to change. Since we can’t enforce much uniformity on the data we accept, our system has to be very flexible. Add to that the fact that supporting large volumes of data is also required and it’s got plenty of challenges.
- Doing it badly is acutely painful for our clients. And when our clients feel pain they naturally pass it along to us. I’d say at least 10% of our support requests have been related in some way to our upload system.
- It’s a great project to do test-driven development (TDD), which is my favorite way to work. Since data uploads are so deterministic it’s very easy to work on the problem in a straightforward TDD manner.
- It requires a high level of parallelism to run quickly. Parallel programming is a fun challenge in itself and the payoff is great when it works.
I’m pretty happy with the way the project turned out. I made a screencast showing off some of the new front-end features:
ActionKit Upload Improvements on Vimeo.
It’s also my first attempt at screencasting. Eesh, my voice.
The frontend uses jQuery with jQote to get updates from the running upload job and update the status display. The progress bar is canvas-based and uses RGraph.
The backend code uses Celery to queue upload jobs from our Django front-end. The jobs themselves use a multiprocessing-based job pool system which we first developed for our mail sender, and has since been abstracted out as a reusable component by my co-worker Randall Farmer (could be worth releasing on PyPi at some point, it has some unique features). Stopping an upload early works by sending a message with Carrot, the underlying AMQP client used by Celery to talk to RabbitMQ.
I tried a new approach this time with regards to the way errors and status are handled by the workers. Instead of trying to report info up to the parent, each worker writes status and errors directly to the database. The parent can query the database to get updates on the workers. This will hopefully help avoid some of the deadlocking problems inherent in systems that rely on bi-directional communication between parents and workers. It also made building the front-end easier, the status reported by the workers was easy to turn into JSON and send up to the client for display.
All in all, a fun project. I hope it works as well in practice as it has during development.
What’s the best data upload system you’ve used? Written? Lived through?
Python Makes Me Say God Damn
I’ve been coding in Python now for almost a year. Mostly it’s been great fun, but a few things continue to annoy me. Most of them are just a matter of taste, some related to my background as a Perl coder no doubt. In any case, I’m hoping writing about them will help me be mindful and might bring up some useful workarounds in response. And ranting is really its own reward.
NOTE: before you tell me I’m an idiot and Python is awesome, allow me to remind you that I like coding in Python and I’m quite sure I could produce a list like this for any language I’ve worked in. So, yes, Python is awesome and Python sucks. I’m sorry if I just blew your mind.
In no particular order:
The range() function is non-inclusive of the second term. If I say to you, give me the numbers from 1 to 5 are you going to say “1, 2, 3, 4”? Of course not. Even worse, Perl has trained me to expect inclusive ranges with the .. operator. So I constantly stub my toe on Python’s range(). It can lead to nasty bugs – sometimes counting one less than expected is obvious, and sometimes it isn’t!
I never know where to look for a method. Let’s say I’m having a hard time figuring out how to use some.awesome.Module’s foobar() method. I’ve checked the doc-string and it’s not helping, I need to use the source. At this point I’ve pretty much stopped trying to guess where it could be – I go straight to ack and start searching the tree under some/. It could be in some.py, some/__init__.py, some/awesome.py, some/awesome/__init__.py, etc. And if it’s a method with a common name – save(), for example – it can be very hard to figure out which save() is actually the one I’m looking for.
Python is happy to do nothing. Consider this code:
[sourcecode language=”python”]
foo = range(1,100)
while len(foo) > 10:
foo.pop
[/sourcecode]
It’s supposed to reduce foo until it’s got 10 elements (there are easier ways to do this, of course – not the point). Actually it loops forever because foo.pop is actually a reference to the pop method. To call it you must include parens:
[sourcecode language=”python”]
foo = range(1,100)
while len(foo) > 10:
foo.pop()
[/sourcecode]
It makes a lot of sense to me that this is the way it is. But couldn’t Python emit a warning when I do this? The code does absolutely nothing useful – the method reference is returned and immediately discarded. Perl does handle this case, emitting a warning about the use of a scalar in void context. I never thought I’d miss that warning, but now I do!
Strings are sequence types. At first glance this probably seems pretty harmless – you can iterate over the characters in a string. Probably useful, right? Well, not for me! I honestly can’t remember the last time I intentionally iterated over a string character by character in a language other than C. If I want to search a string I’ll use a regular expression or a call to index()/find() – faster and easier. So why does it irritate me? It leads to bugs, because it’s all too easy to accidentally put a string where a list should go and Python will then happily iterate over the string. For example, check out this bug (simplified a bit from the actual usage):
[sourcecode language=”python”]
for name, value in request.GET.items():
params[name] = value[0]
[/sourcecode]
The bug here is that value is a string, not an array of values. This bug resulted in each GET param getting truncated to a single character. And it completely escaped my attention because I tested it with parameters that were small numbers – 0, 1 and 5. So it got into live code that then failed when “2010-01-01” became “2”. So lame. I’d much rather Python said something useful like “You can’t index into a string dummy!”
Unicode support is a mess. This one really surprised me. I always thought Unicode in Perl was exceptionally bad, likely because it was always a second-class citizen. I thought for some reason that Python would have a better, saner system. Well, sadly no. In fact, it has pretty much all the same problems that Perl has. Consider this very common code:
[sourcecode language=”text”]
body=”””
Your mailing template for mailing %s has a syntax error:
%s
“”” % ( mailing.id, exception)
[/sourcecode]
That’s the buggy version. It fails when the exception contains a non-ASCII character. There’s no way you can predict when this will be the case – if you could predict bugs that result in exceptions they wouldn’t be nearly as much fun, now would they? Here’s the fixed version:
[sourcecode language=”text”]
body=u”””
Your mailing template for mailing %s has a syntax error:
%s
“”” % ( mailing.id, exception)
[/sourcecode]
Did you spot the difference between code that works great and code that will make you hate yourself when it fails to show you a simple exception? It’s a single ‘u’ in front of the string quotes. You, the Python programmer, are supposed to remember to put that character in front of strings that could someday contain an object with non-ASCII data in it (straight-up unicode strings work by magic, see here for details). You’re supposed to figure it out and tell Python because for some reason upgrading the string at runtime would be the wrong thing to do. Please, shoot me now.
Actually, just wait, you can shoot me the next time I have to debug a random Unicode failure in our code-base. They pop up regularly when some object that nobody thought could have non-ASCII in it happens to get a non-ASCII character. Obviously better testing would help, but some things, like the contents of exceptions, are pretty hard to predict!
So that’s my list – got one of your own? There’s nothing like a good rant, go for it.
Emacs ups and downs
Every month or so I try to learn a new Emacs feature or extension – something beyond the usual buffer juggling and programming-language modes. Of course when you’re trying new things so frequently some of them are going to work better than others. Things I’ve tried that stuck:
- Keyboard Macros – probably the first “advanced” Emacs feature I learned and I use it all the time. I don’t save and name my macros as often as I should though.
- Tramp – tramp-mode allows me to run a fast local Emacs and edit files remotely with no setup required. Just open
/server.example.com:
and go. Underneath it uses SSH to access the files, and you can set it to use alternate methods (scp, sftp, rsync, etc). I do wish it was a little faster though, or multi-threaded so it didn’t block Emacs when saving over a slow link. - Bookmarks – I bind
M-b
tobookmark-jump
and I use it all the time. I have bookmarks for each project I’m working on and I use them with tramp-mode to get me onto the appropriate server. - Yasnippet – a recent add, this is a module which provides a template system for Emacs. It comes with some useful boilerplate templates for various programming languages and you can easily add more. I use the class and def ones for Python periodically as well as ones I’ve added to set up a warn() call.
- browse-kill-ring – super useful to be able to pull up the full kill-ring and search for what you need. I have my kill-ring set to hold 100,000 entries, so if I’ve killed it in the current session I can be pretty sure I can get it back!
- auto-complete – mode-aware auto-completion. I’m still not sure this one is going to last, but it does help a lot sometimes, particularly when I’m coding deep in a Python file and I need to accurately type the name of an imported identifier from the top of the file. And I’m getting more used to hitting C-g when I need to keep what I’ve typed and not accept a completion. I think it’s most likely a keeper.
Of course, not every experiment is a success. Here’s a few notable recent ones that I’ve since abandoned:
- ido-mode – I wanted to like this one. Sometimes it’s a big time-saver, quickly navigating to files I’m trying to open in just a few keypresses. But just as often I’d find myself fighting with it, particularly when trying to create new files or navigate up a few levels. Ultimately I decided that the benefits of having a file path that’s editable the same way as normal text is just too much to give up.
- registers – this still seems like something I should be using. Surely the ability to remember locations and little bits of text and then replay them should come in handy. Alas, not often enough to actually remember the keystrokes on the rare occasions when I think to use them.
- rectangular selections – again, potentially very useful but I don’t need it often enough to remember the bindings. It doesn’t help that the default binds are so verbose, possibly I could learn to love this feature if I rebound it.
- Tags – I’ve setup TAG file generation for several projects now, and each time I use it for a while and then fall back to grep and ack. I think the way TAG searches work just doesn’t match the way I want to search for things – I want to quickly browse through a list of hits, not jump from file to file. Still, being able to jump from the use of a function directly to its definition seems like it should be very useful!
I’m always curious about how other people use Emacs – what features do you use most and what have you tried that didn’t work out?
New release: onlinepayment v1.0.0
I’ve finally gotten around to doing my first open-source Python release, v1.0.0 of onlinepayment:
http://pypi.python.org/pypi/onlinepayment/1.0.0
This module provides a wrapper around two payment processors so you can write code that works with both. It’s based on a Perl module which does the same thing – Business::OnlinePayment. It goes further than Business::OnlinePayment – providing error handling and, thanks to my co-worker Aaron Ross, recurring billing too.
This is also the first open-source code to result from ActionKit. I hope we’ll find other opportunities like this in the future, we’ve got a lot of useful code in the project.
On Python Present and Perl Past
I’ve been working in Python almost exclusively for the last 8 or 9 months. It’s been a fun challenge learning a new language, and being able to do it along with the rest of the We Also Walk Dogs crew has made it even better. A dip back into Perl has given me a chance to reflect on my progress with Python.
The past couple weekends I’ve been helping a friend with a small project – a textual analysis problem finding similarities between disparate documents in a large database. I immediately reached for Perl because I’ve done projects like this in Perl before and I knew all the tools I’d need. I indexed phrases from the docs using Digest::MD5, storing the index in MySQL with DBD::mysql. Then I whipped up a quick web-app with CGI::Application and HTML::Template, with a bit of help from HTML::FillinForm, Blueprint CSS and Config::General. With the exception of Blueprint (which I find indispensable these days), this is pretty much the toolkit I learned (and helped build) at Vanguard Media working for Jesse Erlbaum over ten years ago! It all worked great and the app was up and running in just a day and a half.
This experience makes it obvious to me that I still have a long way to go with Python. Perl syntax is second nature to me – I almost never make an error and everything works the way I expect it to. I’m not looking up basic Python syntax anymore but I am still making plenty of mistakes. More to the point, the tools I needed to use for this project are all still completely ingrained in my memory, and behave exactly the way I expect them to. Compared to the Python tools I use daily (Django, Celery and MySQLdb, for example), the difference is really impressive. My Python tools often surprise me and I find myself going back to the docs, and failing that the source, frequently.
It’s also interesting to think about how little has changed in the past 10 years. I can pick up the same tools I used then and construct something that most people would recognize as a modern web app. Mix in just a little Jquery and it would probably pass for Web 2.0. On the other hand, I think I can say I’ve gotten better as I’ve aged – this project would have taken me quite a bit longer 10 years ago, if I could have completed it at all. I probably would have gotten stuck on some completely insane plan like loading all the documents into memory at once. I really didn’t know how to properly use a database back then!
I am looking forward to making my first open-source Python release soon. Who knows, maybe there’s a book about writing Python packages for PyPi in my future!
My Setup
I was reading through an interview series called The Setup recently. Nerds are asked to describe their gear, both hardware and software, and then what their dream version would be. Fun stuff – highlights for me — RMS, Aaron Swartz (one of our clients for ActionKit!) and MáirÃn Duffy (one of the few Linux users profiled aside from RMS). It’s all very Apple heavy, but still interesting to think about how different people setup their kit for fairly similar tasks. It made me want to put in my two cents, but I don’t think I’m famous enough to rate a slot on the site, so…
Who are you, and what do you do?
I’m Sam Tregar and I spend most of my time coding for We Also Walk Dogs, working with our many progressive political and non-profit clients. I work mostly in Python and Perl.
What hardware are you using?
My main work machine is a three year old Thinkpad T60p – a 15.4″ widescreen model with a very sharp 1680×1050 resolution, 2GB ram and a 2Ghz Core 2 Duo CPU. I recently replaced the hard drive with a fast SSD which was a huge upgrade, equivalent to getting a whole new machine for much less money. I keep the trackpad turned off – my typing style is so right-handed that I frequently palm-over the trackpad – and use the trackpoint exclusively. I’ve been using Thinkpads for a while now and it really comes down to the keyboard – all the keys are in the right place, and all the keys are big enough for my sausage-esque fingers.
At the office I used to have a sweet little Shuttle KPC machine with a Celeron 450 and 2GB of ram. Then about 3 months past its one year warranty it stopped working. So now I’m using my wife’s discarded four year old Dell laptop – a 2.2Ghz Core 2 Duo with 1GB of ram. It used to have 2GB before one of the ram slots mysteriously turned bad. I have it hooked up to a 21″ widescreen LCD running at 1680×1050. My keyboard is a Keytronic Lifetime with the all-important classic IBM layout, just like the Thinkpad. I use a Kensington Expert “Mouse” which is actually a trackball. It’s nice but I really miss the trackballs Logitech used to make.
I’ve also got a custom-built gaming rig. It’s an AMD Phenom II BE450 with 4GB of ram and an Nvidia GeForce 216 GPU. Years ago I put out the cash for a fast WG Raptor 10k disk, but I think about replacing it with an SSD frequently. It’s in a fancy low-noise case, the Antec P180.
Rounding out the local network is a Shuttle KPC operating as a storage server and home for occasional personal-use web apps, like the app I use to tell me when to leave the house to make a train. It’s got a Celeron of some type, 2GB of ram and 1TB of storage. I’m hoping it doesn’t suffer a similar fate to my late office model.
Oh, and I get my ass out of bed with a Chumby, the best Linux-powered alarm clock ever.
And what software?
I’ve been using Linux as a desktop OS since I was 15 years old – Slackware v1.1.2 on my 386DX40. I consider myself extremely lucky to have grown up with Linux and to be able to use it both on my machines and on virtually every server I work on (once in a blue moon I’ll work on a BSD or Sun box). My distro of choice is currently Fedora – I’m running Fedora 11 on most of my machines, but my office laptop has Fedora 12. The only machine I don’t run Linux on is the game machine, which is running Windows 7.
Far and away the most important app for me is Gnu Emacs. I’ve been using Emacs for around 13 years now and I’m still learning new things. I have a 400+ line .emacs file with tons of custom functions, many of which I use daily. Every so often I play with a new editor but I honestly can’t imagine working without Emacs.
The other two programs I use when I code are Chrome (I’m a recent convert from Firefox, so I still flip back now and again) and Gnome Terminal, invariably connected to one or more screen sessions. I read my email with Gmail, and I use Pidgin for IM and IRC.
What would be your dream setup?
I hate synchronizing things. I want to be able to sit down at any machine and have all my customized setup magically available. I’ve got some of this now between Xmarks for Chrome+Firefox and a private Subversion repository for my dot-files, but there’s a lot that’s not accounted for. I’ve standardized all my hardware on a common screen resolution just to make this easier (1680×1050) – same (gigantic) font sizes, identical Gnome layout, etc.
Another frequent wish is that I could have a full-size keyboard, trackball and screen available while I’m working in my living room. I actually achieved this during college with a hilariously dangerous monitor arm attached to a futon. Imagine sitting with a 20″ CRT directly above your lap, the monitor arm and futon groaning audibly under the weight. Someday I may recreate that beautiful setup with a lightweight LCD. Ideally the screen would just float in front of me, held up by fairies. And hey, since we’re headed in that direction, drop the keyboard and trackball – I’ll just control it by telepathy. This is going to be great.
I always want a faster connection – lower latency and more bandwidth. It’s a mark of newfound fiscal responsibility that I haven’t yet ordered Optimum Ultra, the ultra-fast, ultra-expensive data plan from my cable company. If/when they drop the $300 install fee I’ll probably do it.
Reverse Engineering Precinct Maps
During the run-up to the 2008 election I had the opportunity to work on a challenging mapping problem, as part of my work for We Also Walk Dogs. Our client MoveOn.org planned to run a multi-state get-out-the-vote field campaign. This kind of campaign involves teams of volunteers going door-to-door having conversations with voters and recording the results. Later the same volunteers will help turn out voters to their polling places.
The early phase of this project focused on picking the areas where the campaign would operate – a process known as turf assignment. The traditional way to do this is to assign work by precinct. We needed a way to display precinct maps visually to aid in picking precincts and assigning them to offices.
You might think that there would be some publicly available resource for precinct maps in a consistent format covering every state. Sadly, this is not the case. In fact, for reasons that escape me to this day precinct maps are carefully guarded secrets controlled at the state level. It is sometimes possible to buy the precinct map for a given state, but not always and certainly not in a single standardized format.
What we did have was the voter file. The voter file is a publicly available list of every voter in every state (almost, nothing is perfect of course). Critically for our purposes it includes both the voter’s address and their precinct. We theorized that we could essentially reverse engineer the precinct maps from the voter file data – essentially defining a precinct as a list of voters in that
precinct and then drawing a shape which enclosed them.
To give you an idea of how I thought it would work, consider this map:
The dots on the map are voters assigned to three different precincts. My idea was to come up with some way to draw shapes around them which would approximate the actual precinct shape. For example:
My first attempt at solving this problem was to write a geometric algorithm which attempt to find what’s call the “convex hull” containing all the voters in a precinct. The simplest way to think about a convex hull is as the shape a rubber-band would make if you pulled it around all the points. In this case it might look something like:
Not bad, right? Sadly things in the real world aren’t so simple! It’s unfortunately all too common to have precincts more like this:
And when you try to draw a convex hull around these points you get:
And if you think that’s bad, imagine what that would look like if there were a few more precincts shown and they all had overlapping segments. It’s not unheard of for one precinct to entirely enclose another!
Even worse, the voter file data contains errors – some people are assigned precincts that are actually many miles away from where they should be. When you try to fit a convex hull around a precinct with just one bad address you get a shape with a very long spike. Ugly and ultimately unusable.
My first solution to this problem was to divide the map into a grid and color each grid box according to its composition, subdividing as necessary. The maps produced this way were actually surprisingly usable, given the brute-force nature of the algorithm. They looked something like:
This worked but it was still pretty far from ideal. In particular it looked nothing like the pretty paper precinct maps that people are used to looking at.
The final solution I arrived at after boning up on my math skills was something called a Voronoi diagram. In simple terms a Voronoi diagram forms shapes by including all the area which is closer to a given point than any other. It’s ideal for building up maps based on a set of points.
Here’s what the final results looked like:
As you can see the Voronoi algorithm is able to construct shapes to fit very complex constraints, and it even provided enough shape data to draw outlines around the shapes. Compared side-by-side with the real precinct maps (often just scans of paper maps, sadly) the generated maps were often very close.
If you’re interested in using the Voronoi algorithm in your own code I was able to release the code of it as a Perl module on CPAN. You can download it here:
http://search.cpan.org/~samtregar/Math-Geometry-Voronoi/
I’m hoping the 2010 elections will give me a good excuse to dive back into this problem – there’s still so much that could be done to make precinct maps easier to use.