Implementing file reading in C/Python

Wed Jan 14 05:34:48 EST 2009

Johannes Bauer <dfnsonfsduifb at gmx.de> writes:

> Yup, I changed the Python code to behave the same way the C code did -
> however overall it's not much of an improvement: Takes about 15 minutes
> to execute (still factor 23).

Not sure this is completely fair if you're only looking for a pure
Python solution, but to be honest, looping through a gazillion
individual bytes of information sort of begs for trying to offload
that into a library that can execute faster, while maintaining the
convenience of Python outside of the pure number crunching.

I'd assume numeric/numpy might have applicable functions, but I don't
use those libraries much, whereas I've been using OpenCV recently for
a lot of image processing work, and it has matrix/histogram support,
which seems to be a good match for your needs.

For example, assuming the OpenCV library and ctypes-opencv wrapper, add
the following before the file I/O loop:

    from opencv import *

    # Histogram for each file chunk
    hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)])

then, replace (using one of your posted methods as a sample):

    datamap = { }
    for i in data:
        datamap[i] = datamap.get(i, 0) + 1

    array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True)
    most = ord(array[0][1])

with:

    matrix = cvMat(1, len(data), CV_8UC1, data)
    cvCalcHist([matrix], hist)
    most = cvGetMinMaxHistValue(hist,
                                min_val = False, max_val = False,
                                min_idx = False, max_idx = True)

should give you your results in a fraction of the time.  I didn't run
with a full size data file, but for a smaller one using smaller chunks
the OpenCV varient ran in about 1/10 of the time, and that was while
leaving all the other remaining Python code in place.

Note that it may not be identical results to some of your other
methods in the case of multiple values with the same counts, as the
OpenCV histogram min/max call will always pick the lower value in such
cases, whereas some of your code (such as above) will pick the upper
value, or your original code depended on the order of information
returned by dict.items.

This sort of small dedicated high performance choke point is probably
also perfect for something like Pyrex/Cython, although that would
require a compiler to build the extension for the histogram code.

-- David