Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?

Thu Dec 16 16:45:14 EST 2010

python at bdurham.com, 16.12.2010 21:03:
> Is text processing with dicts a good use case for Python
> cross-compilers like Cython/Pyrex or ShedSkin? (I've read the
> cross compiler claims about massive increases in pure numeric
> performance).

Cython is generally a good choice for string processing, simply because it 
can drop a lot of code into plain C, such as character iteration and 
comparison. Depending on what kind of operations you do, you can get 
speed-ups of 100x or more for that.

http://docs.cython.org/src/tutorial/strings.html

However, when it comes to dict lookups, it uses CPython's own dicts which 
are heavily optimised for string lookups already. So the speedup in that 
area will likely stay below 30%. Similarly, encoding and decoding use 
Python's codecs, so don't expect a major difference there.

> I have 3 use cases I'm considering for Python-to-C++
> cross-compilers for generating 32-bit Python extension modules
> for Python 2.7 for Windows.
>
> 1. Parsing UTF-8 files (basic Python with lots of string
> processing and dict lookups)

"Parsing" sounds like something that could easily benefit from Cython 
compilation.

> 2. Generating UTF-8 files from nested list/dict structures

That should be much faster in Cython, too, simply because iteration on 
builtin types is much faster than in Python.

> 3. Parsing large ASCII "CSV-like" files and using dict's to
> calculate simple statistics like running totals, min, max, etc.

Again, parsing will be much faster, especially when reading from raw C 
files (which would also enable freeing the GIL, in case you want to use 
multi-threading). The rest may not win that much.

A nice feature of Cython is that you do not have to go low-level right 
away. You can use all the niceness of Python, and only push the code closer 
to C level where your benchmarks point you. And if you really have to go 
all the way down to C, it's just a declaration away.

> Are any of these text processing scenarios good use cases for
> tools like Cython, Pyrex, or ShedSkin? Are any of these
> specifically bad use cases for these tools?

Pyrex isn't worth trying here, simply because you'd have to invest a lot 
more work to make it as fast as what Cython gives you anyway. ShedSkin may 
be worth a try, depending on how well you get your ShedSkin module 
integrated with CPython. (It seems that it has support for building 
extension modules by now, but I have no idea how well that is fleshed out).

> We've tried Psyco and it has sped up some of our parsing
> utilities by 200%. But Psyco doesn't support Python 2.7 yet and
> we're committed to using Python 2.7 moving forward.

If 3x is not enough for you, I strongly suggest you try Cython. The C code 
that it generates compiles nicely in all major Python versions, currently 
from 2.3 to 3.2.

Stefan