python distributed computing

Wed Aug 3 11:51:20 EDT 2011

On Wed, Aug 3, 2011 at 4:37 PM, julien godin <t4rtine at gmail.com> wrote:
> I have a HUUUUUGE(everything is relative) amount of data coming from UDP/514
> ( good old syslog), and one process alone could not handle them ( it's
> full of regex, comparison and counting. all this at about 100K lines of data
> per seconds and each line is about 1500o long. I tried to handle it on one
> process, 100% core time in htop. )
> So i said to myself : why not distribute the computing ?

You could brute-force it by forking out to multiple computers, but
that would entail network traffic, which might defeat the purpose. If
you have a multi-core CPU or multi-CPU computer, a better option would
be to look into the multiprocessing module for some simple ways to
divide the work between cores/CPUs. Otherwise, there's a few things to
try.

First and most important thing: Optimize your algorithms! Can you do
less work and still get the same result? For instance, can you replace
the regex with two or three simpler string methods? Profile your code
to find out where most of the time is being spent, and see if you can
recode those parts.

Second: Try PyPy or Cython for higher-performance Python code.

And third, if you still can't get the performance you need, consider
changing languages. Python isn't the best language for fast execution
- it's good for fast coding. Rewrite some or all of your code in C,
Pike, COBOL, raw assembly language... okay, maybe not those last two!
Since you profiled your code up in step 1, you'll know which parts are
the best candidates for C code.

I think there are quite a few options better than forking across
computers; although distributed computing IS a lot of fun.

Chris Angelico