Processing large CSV files - how to maximise throughput?

Thu Oct 24 23:19:53 EDT 2013

On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:

> Hi,
> 
> We have a directory of large CSV files that we'd like to process in
> Python.
> 
> We process each input CSV, then generate a corresponding output CSV
> file.
> 
> input CSV -> munging text, lookups etc. -> output CSV
> 
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming

Start with the simplest thing that could work:

for infile, outfile in zip(input_list, output_list):
    for line in infile:
        munge line
        write to outfile

sort of thing. If and only if it isn't fast enough, then try to speed it 
up.

> For the reading, I'd
> 
>     with open('input.csv', 'r') as input, open('output.csv', 'w') as
>     output:
>         csv_writer = DictWriter(output)
>         for line in DictReader(input):
>             # Do some processing for that line... output =
>             process_line(line)
>             # Write output to file
>             csv_writer.writerow(output)

Looks good to me!

> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
> 
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?

Have you read the csv_writer documentation? 

http://docs.python.org/2/library/csv.html#writer-objects

Unfortunately it is pretty light documentation, but it seems that 
writer.writerow *probably* just calls write on the underlying file object 
immediately. 

But there's really no way to tell when the data hits the disk platter: 
the Python file object could be doing caching, the OS could be doing 
caching, the file system could be doing caching, and even the disk itself 
could be doing caching. Really, the only way to be sure that the data has 
hit the disk platter is to call os.sync(), and even then some hard drives 
lie and report that they're synced when in fact the data is still in 
volatile cache. Bad hard drive, no biscuit.

But, really, do you need to care? Better to buy better hard drives (e.g. 
server grade), or use software RAID with two different brands (so their 
failure characteristics will be different), or just be prepared to re-
process a batch of files if the power goes out mid-run. (You do have a 
UPS, don't you?)

Anyway, I wouldn't bother about calling os.sync directly, the OS will 
sync when it needs to. But if you need it, it's there. Or you can call 
flush() on the output file, which is a bit less invasive than calling 
os.sync. But really I wouldn't bother. Let the OS handle it.

> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?

I shouldn't think so. 

> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
> 
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?

Since this is likely to be I/O-bound, you could use threads. Each thread 
is responsible for reading the file, processing it, then writing it back 
again. Have you used threads before? If not, start here:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/
http://pymotw.com/2/threading/‎

If the amount of processing required becomes heavier, and the task 
becomes CPU-bound, you can either:

- move to an implementation of Python without the GIL, like 
  IronPython or Jython;

- or use multiprocessing.

-- 
Steven