[SciPy-User] numpy I/O question
Zachary Pincus
zachary.pincus at yale.edu
Sun Jan 2 11:21:05 EST 2011
> These files are pipe-streams but when they are dumped they are about
> 50M.
>
> Replacement that you described requires O(N) (where N is line
> length) but
> C++ operator>> requires O(1) for the same parsing.
Reading the file into an array is still an O(N) operation, so if all
you you care about is big-O complexity, there's no difference between
doing an O(N) search-and-replace followed by an O(N) load operation
versus an O(1) parsing followed by an O(N) load operation. O(2N) =
O(N), right?
But if you care about constant factors, why are you even proposing
regexp matching?
Have you even tried writing up the simple case search-and-replace to
determine whether it's too slow?
If you actually need to optimize the file reading (unlikely), perhaps
the fastest option will be to use the subprocess module to open a
pipeline to sed and then feed the stdout of that to numpy.loadtxt --
sed is well-optimized to have low constant factors.
Indeed, these days disks are such a bottleneck that it can be faster
to read a gzipped file from disk and decompress it on the fly and
parse the contents than just to read the plain file from disk. But as
you say the input format is out of your hands. (And again, if speed
matters so much, why are the files ASCII text and not binary? But if
speed doesn't matter, why the concern about asymptotic complexity?)
Anyway, if for religious reasons sed is unacceptable, another decent
option if the files are too large for memory (which 50M is
emphatically not) would be to open the text file in chunks, do the
search-and-replace, and then cough up those chunks within an iterator
that acts as a file-like-object.
> I will be asked 'why should we use python which even can't parse as
> good as
> c++ does?' `sed` isn't a solution.
This sounds like a personal problem. Sed is a perfectly decent
solution for reformatting broken text files, as is reformatting the
files internally to python before passing them to a numpy routine
designed to be flexible and fast at handling *delimited* text.
The fact that C++ has a particular feature that happens to work well
with your buggy input files doesn't mean that "python can't parse as
well as c++" -- but hey, if you think c++ is in general a better tool
than python or sed or perl or whatever for processing text files, go
for it.
More information about the SciPy-User
mailing list