Is text processing with dicts a good use case for Python cross-compilers like Cython/Pyrex or ShedSkin?
Stefan Behnel
stefan_ml at behnel.de
Thu Dec 16 16:45:14 EST 2010
python at bdurham.com, 16.12.2010 21:03:
> Is text processing with dicts a good use case for Python
> cross-compilers like Cython/Pyrex or ShedSkin? (I've read the
> cross compiler claims about massive increases in pure numeric
> performance).
Cython is generally a good choice for string processing, simply because it
can drop a lot of code into plain C, such as character iteration and
comparison. Depending on what kind of operations you do, you can get
speed-ups of 100x or more for that.
http://docs.cython.org/src/tutorial/strings.html
However, when it comes to dict lookups, it uses CPython's own dicts which
are heavily optimised for string lookups already. So the speedup in that
area will likely stay below 30%. Similarly, encoding and decoding use
Python's codecs, so don't expect a major difference there.
> I have 3 use cases I'm considering for Python-to-C++
> cross-compilers for generating 32-bit Python extension modules
> for Python 2.7 for Windows.
>
> 1. Parsing UTF-8 files (basic Python with lots of string
> processing and dict lookups)
"Parsing" sounds like something that could easily benefit from Cython
compilation.
> 2. Generating UTF-8 files from nested list/dict structures
That should be much faster in Cython, too, simply because iteration on
builtin types is much faster than in Python.
> 3. Parsing large ASCII "CSV-like" files and using dict's to
> calculate simple statistics like running totals, min, max, etc.
Again, parsing will be much faster, especially when reading from raw C
files (which would also enable freeing the GIL, in case you want to use
multi-threading). The rest may not win that much.
A nice feature of Cython is that you do not have to go low-level right
away. You can use all the niceness of Python, and only push the code closer
to C level where your benchmarks point you. And if you really have to go
all the way down to C, it's just a declaration away.
> Are any of these text processing scenarios good use cases for
> tools like Cython, Pyrex, or ShedSkin? Are any of these
> specifically bad use cases for these tools?
Pyrex isn't worth trying here, simply because you'd have to invest a lot
more work to make it as fast as what Cython gives you anyway. ShedSkin may
be worth a try, depending on how well you get your ShedSkin module
integrated with CPython. (It seems that it has support for building
extension modules by now, but I have no idea how well that is fleshed out).
> We've tried Psyco and it has sped up some of our parsing
> utilities by 200%. But Psyco doesn't support Python 2.7 yet and
> we're committed to using Python 2.7 moving forward.
If 3x is not enough for you, I strongly suggest you try Cython. The C code
that it generates compiles nicely in all major Python versions, currently
from 2.3 to 3.2.
Stefan
More information about the Python-list
mailing list