CSV performance

Peter Otten __peter__ at web.de
Mon Apr 27 16:49:29 CEST 2009

grocery_stocker wrote:

> On Apr 27, 5:15 am, Peter Otten <__pete... at web.de> wrote:
>> psaff... at googlemail.com wrote:
>> > I'm using the CSV library to process a large amount of data - 28
>> > files, each of 130MB. Just reading in the data from one file and
>> > filing it into very simple data structures (numpy arrays and a
>> > cstringio) takes around 10 seconds. If I just slurp one file into a
>> > string, it only takes about a second, so I/O is not the bottleneck. Is
>> > it really taking 9 seconds just to split the lines and set the
>> > variables?
>> > Is there some way I can improve the CSV performance?
>> My ideas:
>> (1) Disable cyclic garbage collection while you read the file into your
>> data structure:
>> import gc
>> gc.disable()
>> # create many small objects that you want to keep
>> gc.enable()
>> (2) If your data contains only numerical data without quotes use
>> numpy.fromfile()
> How would disabling the cyclic garbage collection make it go faster in
> this case?

When Python creates many objects and doesn't release any it is assumed that
they are kept due to cyclic references. When you know that you actually
want to keep all those objects you can temporarily disable garbage
collection. E. g.:

$ cat gcdemo.py
import time
import sys
import gc

def main(float=float):
    if "-d" in sys.argv:
        status = "disabled"
        status = "enabled"
    all = []
    append = all.append
    start = time.time()
    floats = ["1.234"] * 10
    assert len(set(map(id, map(float, floats)))) == len(floats)
    for _ in xrange(10**6):
        append(map(float, floats))
    print time.time() - start, "(garbage collection %s)" % status


$ python gcdemo.py -d
11.6144971848 (garbage collection disabled)
$ python gcdemo.py
15.5317759514 (garbage collection enabled)

Of course I don't know whether this is actually a problem for the OP's code.


More information about the Python-list mailing list