Processing large CSV files - how to maximise throughput?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Oct 24 23:19:53 EDT 2013
On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:
> Hi,
>
> We have a directory of large CSV files that we'd like to process in
> Python.
>
> We process each input CSV, then generate a corresponding output CSV
> file.
>
> input CSV -> munging text, lookups etc. -> output CSV
>
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming
Start with the simplest thing that could work:
for infile, outfile in zip(input_list, output_list):
for line in infile:
munge line
write to outfile
sort of thing. If and only if it isn't fast enough, then try to speed it
up.
> For the reading, I'd
>
> with open('input.csv', 'r') as input, open('output.csv', 'w') as
> output:
> csv_writer = DictWriter(output)
> for line in DictReader(input):
> # Do some processing for that line... output =
> process_line(line)
> # Write output to file
> csv_writer.writerow(output)
Looks good to me!
> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
>
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?
Have you read the csv_writer documentation?
http://docs.python.org/2/library/csv.html#writer-objects
Unfortunately it is pretty light documentation, but it seems that
writer.writerow *probably* just calls write on the underlying file object
immediately.
But there's really no way to tell when the data hits the disk platter:
the Python file object could be doing caching, the OS could be doing
caching, the file system could be doing caching, and even the disk itself
could be doing caching. Really, the only way to be sure that the data has
hit the disk platter is to call os.sync(), and even then some hard drives
lie and report that they're synced when in fact the data is still in
volatile cache. Bad hard drive, no biscuit.
But, really, do you need to care? Better to buy better hard drives (e.g.
server grade), or use software RAID with two different brands (so their
failure characteristics will be different), or just be prepared to re-
process a batch of files if the power goes out mid-run. (You do have a
UPS, don't you?)
Anyway, I wouldn't bother about calling os.sync directly, the OS will
sync when it needs to. But if you need it, it's there. Or you can call
flush() on the output file, which is a bit less invasive than calling
os.sync. But really I wouldn't bother. Let the OS handle it.
> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?
I shouldn't think so.
> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
>
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?
Since this is likely to be I/O-bound, you could use threads. Each thread
is responsible for reading the file, processing it, then writing it back
again. Have you used threads before? If not, start here:
http://www.ibm.com/developerworks/aix/library/au-threadingpython/
http://pymotw.com/2/threading/
If the amount of processing required becomes heavier, and the task
becomes CPU-bound, you can either:
- move to an implementation of Python without the GIL, like
IronPython or Jython;
- or use multiprocessing.
--
Steven
More information about the Python-list
mailing list