[SciPy-User] numpy I/O question

Sun Jan 2 11:21:05 EST 2011

> These files are pipe-streams but when they are dumped they are about  
> 50M.
>
> Replacement that you described requires O(N) (where N is line  
> length) but
> C++ operator>> requires O(1) for the same parsing.

Reading the file into an array is still an O(N) operation, so if all  
you you care about is big-O complexity, there's no difference between  
doing an O(N) search-and-replace followed by an O(N) load operation  
versus an O(1) parsing followed by an O(N) load operation. O(2N) =  
O(N), right?

But if you care about constant factors, why are you even proposing  
regexp matching?

Have you even tried writing up the simple case search-and-replace to  
determine whether it's too slow?

If you actually need to optimize the file reading (unlikely), perhaps  
the fastest option will be to use the subprocess module to open a  
pipeline to sed and then feed the stdout of that to numpy.loadtxt --  
sed is well-optimized to have low constant factors.

Indeed, these days disks are such a bottleneck that it can be faster  
to read a gzipped file from disk and decompress it on the fly and  
parse the contents than just to read the plain file from disk. But as  
you say the input format is out of your hands. (And again, if speed  
matters so much, why are the files ASCII text and not binary? But if  
speed doesn't matter, why the concern about asymptotic complexity?)

Anyway, if for religious reasons sed is unacceptable, another decent  
option if the files are too large for memory (which 50M is  
emphatically not) would be to open the text file in chunks, do the  
search-and-replace, and then cough up those chunks within an iterator  
that acts as a file-like-object.

> I will be asked 'why should we use python which even can't parse as  
> good as
> c++ does?' `sed` isn't a solution.

This sounds like a personal problem. Sed is a perfectly decent  
solution for reformatting broken text files, as is reformatting the  
files internally to python before passing them to a numpy routine  
designed to be flexible and fast at handling *delimited* text.

The fact that C++ has a particular feature that happens to work well  
with your buggy input files doesn't mean that "python can't parse as  
well as c++" -- but hey, if you think c++ is in general a better tool  
than python or sed or perl or whatever for processing text files, go  
for it.