fast text processing
Alexis Gallagher
public at alexisgallagher.com
Tue Feb 21 03:19:23 EST 2006
(I tried to post this yesterday but I think my ISP ate it. Apologies if
this is a double-post.)
Is it possible to do very fast string processing in python? My
bioinformatics application needs to scan very large ASCII files (80GB+),
compare adjacent lines, and conditionally do some further processing. I
believe the disk i/o is the main bottleneck so for now that's what I'm
optimizing. What I have now is roughly as follows (on python 2.3.5).
filehandle = open("data",'r',buffering=1000)
lastLine = filehandle.readline()
for currentLine in filehandle.readlines():
lastTokens = lastLine.strip().split(delimiter)
currentTokens = currentLine.strip().split(delimiter)
lastGeno = extract(lastTokens[0])
currentGeno = extract(currentTokens[0])
# prepare for next iteration
lastLine = currentLine
if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))
So on every iteration I'm processing mutable strings -- this seems
wrong. What's the best way to speed this up? Can I switch to some fast
byte-oriented immutable string library? Are there optimizing compilers?
Are there better ways to prep the file handle?
Perhaps this is a job for C, but I am of that soft generation which
fears memory management. I'd need to learn how to do buffered reading in
C, how to wrap the C in python, and how to let the C call back into
python to call markEquivalent(). It sounds painful. I _have_ done some
benchmark comparisons of only the underlying line-based file reading
against a Common Lisp version, but I doubt I'm using the optimal
construct in either language so I hesitate to trust my results, and
anyway the interlanguage bridge will be even more obscure in that case.
Much obliged for any help,
Alexis
More information about the Python-list
mailing list