The Upload System of My Dreams

Sometimes I have a great job, a job where I get to do exactly what I want in exactly the way I want to do it. And what is it, you might ask, that I want to do? I want to build the perfect data upload system. Why? A few reasons:

It’s not an easy problem. Data upload is complicated by the fact that the most common format we support (CSV) isn’t even close to standardized. Line-endings, character sets and quoting are all likely to change. Since we can’t enforce much uniformity on the data we accept, our system has to be very flexible. Add to that the fact that supporting large volumes of data is also required and it’s got plenty of challenges.
Doing it badly is acutely painful for our clients. And when our clients feel pain they naturally pass it along to us. I’d say at least 10% of our support requests have been related in some way to our upload system.
It’s a great project to do test-driven development (TDD), which is my favorite way to work. Since data uploads are so deterministic it’s very easy to work on the problem in a straightforward TDD manner.
It requires a high level of parallelism to run quickly. Parallel programming is a fun challenge in itself and the payoff is great when it works.

I’m pretty happy with the way the project turned out. I made a screencast showing off some of the new front-end features:

ActionKit Upload Improvements on Vimeo.

It’s also my first attempt at screencasting. Eesh, my voice.

The frontend uses jQuery with jQote to get updates from the running upload job and update the status display. The progress bar is canvas-based and uses RGraph.

The backend code uses Celery to queue upload jobs from our Django front-end. The jobs themselves use a multiprocessing-based job pool system which we first developed for our mail sender, and has since been abstracted out as a reusable component by my co-worker Randall Farmer (could be worth releasing on PyPi at some point, it has some unique features). Stopping an upload early works by sending a message with Carrot, the underlying AMQP client used by Celery to talk to RabbitMQ.

I tried a new approach this time with regards to the way errors and status are handled by the workers. Instead of trying to report info up to the parent, each worker writes status and errors directly to the database. The parent can query the database to get updates on the workers. This will hopefully help avoid some of the deadlocking problems inherent in systems that rely on bi-directional communication between parents and workers. It also made building the front-end easier, the status reported by the workers was easy to turn into JSON and send up to the client for display.

All in all, a fun project. I hope it works as well in practice as it has during development.

What’s the best data upload system you’ve used? Written? Lived through?

4 Comments

Marc says:

December 31, 2010 at 3:10 pm

Hi Sam,

Is JQuery doing any error checking or just display handling for the workers UI/UX? We need more inline editing! Not to mention a sweet lime green progress bar. ;)
Sam Tregar says:

December 31, 2010 at 5:14 pm

@Marc
It’s just handling the display. It makes AJAX requests to check for status updates and find new errors and warnings. When it finds new data it updates the display. Pretty standard really – RGraph was the only piece I hadn’t used before.
Aaron Swartz says:

January 6, 2011 at 9:23 am

I can’t tell from the video, but does the new upload system have stable URLs for a job? It’d be great if the URL of the page with the upload progress bar was something like

/upload/jobs/23

and then I could reload that URL if the Ajax updating got stuck or my browser crashed or I was moving to another machine.
Sam Tregar says:

January 6, 2011 at 10:51 am

@Aaron Swartz
Yes, it does.

Airtrout