Sometimes I have a great job, a job where I get to do exactly what I want in exactly the way I want to do it. And what is it, you might ask, that I want to do? I want to build the perfect data upload system. Why? A few reasons:
- It’s not an easy problem. Data upload is complicated by the fact that the most common format we support (CSV) isn’t even close to standardized. Line-endings, character sets and quoting are all likely to change. Since we can’t enforce much uniformity on the data we accept, our system has to be very flexible. Add to that the fact that supporting large volumes of data is also required and it’s got plenty of challenges.
- Doing it badly is acutely painful for our clients. And when our clients feel pain they naturally pass it along to us. I’d say at least 10% of our support requests have been related in some way to our upload system.
- It’s a great project to do test-driven development (TDD), which is my favorite way to work. Since data uploads are so deterministic it’s very easy to work on the problem in a straightforward TDD manner.
- It requires a high level of parallelism to run quickly. Parallel programming is a fun challenge in itself and the payoff is great when it works.
I’m pretty happy with the way the project turned out. I made a screencast showing off some of the new front-end features:
ActionKit Upload Improvements on Vimeo.
It’s also my first attempt at screencasting. Eesh, my voice.
The frontend uses jQuery with jQote to get updates from the running upload job and update the status display. The progress bar is canvas-based and uses RGraph.
The backend code uses Celery to queue upload jobs from our Django front-end. The jobs themselves use a multiprocessing-based job pool system which we first developed for our mail sender, and has since been abstracted out as a reusable component by my co-worker Randall Farmer (could be worth releasing on PyPi at some point, it has some unique features). Stopping an upload early works by sending a message with Carrot, the underlying AMQP client used by Celery to talk to RabbitMQ.
I tried a new approach this time with regards to the way errors and status are handled by the workers. Instead of trying to report info up to the parent, each worker writes status and errors directly to the database. The parent can query the database to get updates on the workers. This will hopefully help avoid some of the deadlocking problems inherent in systems that rely on bi-directional communication between parents and workers. It also made building the front-end easier, the status reported by the workers was easy to turn into JSON and send up to the client for display.
All in all, a fun project. I hope it works as well in practice as it has during development.
What’s the best data upload system you’ve used? Written? Lived through?