speeding up reading files (possibly with cython)

Sat Mar 7 20:16:51 EST 2009

On Mar 8, 9:06 am, per <perfr... at gmail.com> wrote:
> hi all,
>
> i have a program that essentially loops through a textfile file thats
> about 800 MB in size containing tab separated data... my program
> parses this file and stores its fields in a dictionary of lists.
>
> for line in file:
>   split_values = line.strip().split('\t')

line.strip() will strip all leading/trailing whitespace *including*
*tabs*. Not a good idea. Use line.rstrip('\n') -- anything more is
losing data.

>   # do stuff with split_values
>
> currently, this is very slow in python, even if all i do is break up
> each line using split() and store its values in a dictionary, indexing
> by one of the tab separated values in the file.
>
> is this just an overhead of python that's inevitable? do you guys
> think that switching to cython might speed this up, perhaps by
> optimizing the main for loop?  or is this not a viable option?

Not much point in using Cython IMO; loop overhead would be expected to
be a tiny part of the time.

Using the csv module is recommended. However a *WARNING*:

When you save as "Text (Tab delimited)" Excel unnecessarily quotes
embedded commas and quotes.

csv.reader(..., delimiter='\t') acts like Excel reading back its own
output and thus is likely to mangle any quotes that are actually part
of the data, if the writer did not use the same "protocol".

An 800MB file is unlikely to have been created by Excel :-) Presuming
your file was created using '\t'.join(list_of_strings) or equivalent,
you need to use csv.reader(..., delimiter='\t',
quoting=csv.QUOTE_NONE)

For example:
| >>> import csv
| >>> open('Excel_tab_delimited.txt', 'rb').read()
| 'normal\t"embedded,comma"\t"""Hello""embedded-quote"\r\n'
| >>> f = open('simple.tsv', 'wb')
| >>> f.write('normal\tembedded,comma\t"Hello"embedded-quote\r\n')
| >>> f.close()
| >>> list(csv.reader(open('Excel_tab_delimited.txt', 'rb'),
delimiter='\t'))
| [['normal', 'embedded,comma', '"Hello"embedded-quote']]
| >>> list(csv.reader(open('simple.tsv', 'rb'), delimiter='\t'))
| [['normal', 'embedded,comma', 'Helloembedded-quote']]
| # Whoops!
| >>> list(csv.reader(open('simple.tsv', 'rb'), delimiter='\t',
quoting=csv.QUOTE_NONE))
| [['normal', 'embedded,comma', '"Hello"embedded-quote']]
| # OK
| >>>

HTH,
John