speeding up reading files (possibly with cython)

Sun Mar 8 00:27:00 EST 2009

per wrote:

> hi all,
> 
> i have a program that essentially loops through a textfile file thats
> about 800 MB in size containing tab separated data... my program
> parses this file and stores its fields in a dictionary of lists.
> 
> for line in file:
>   split_values = line.strip().split('\t')
>   # do stuff with split_values
> 
> currently, this is very slow in python, even if all i do is break up
> each line using split() and store its values in a dictionary, indexing
> by one of the tab separated values in the file.
> 
> is this just an overhead of python that's inevitable? do you guys
> think that switching to cython might speed this up, perhaps by
> optimizing the main for loop?  or is this not a viable option?

Any time I see large data structures, I always think of memory consumption
and paging. How much memory do you have? My back-of-the-envelope estimate
is that you need at least 1.2 GB to store the 800MB of text, more if the
text is Unicode or if you're on a 64-bit system. If your computer only has
1GB of memory, it's going to be struggling; if it has 2GB, it might be a
little slow, especially if you're running other programs at the same time.

If that's the problem, the solution is: get more memory.

Apart from monitoring virtual memory use, another test you could do is to
see if the time taken to build the data structures scales approximately
linearly with the size of the data. That is, if it takes 2 seconds to read
80MB of data and store it in lists, then it should take around 4 seconds to
do 160MB and 20-30 seconds to do 800MB. If your results are linear, then
there's probably nothing much you can do to speed it up, since the time it
probably dominated by file I/O.

On the other hand, if the time scales worse than linear, there may be hope
to speed it up.

-- 
Steven