Processing large CSV files - how to maximise throughput?
Walter Hurry
walterhurry at lavabit.com
Sat Oct 26 04:53:03 EDT 2013
On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:
> Hi,
>
> We have a directory of large CSV files that we'd like to process in
> Python.
>
> We process each input CSV, then generate a corresponding output CSV
> file.
>
> input CSV -> munging text, lookups etc. -> output CSV
>
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming
>
> For the reading, I'd
>
> with open('input.csv', 'r') as input, open('output.csv', 'w') as
> output:
> csv_writer = DictWriter(output)
> for line in DictReader(input):
> # Do some processing for that line...
> output = process_line(line)
> # Write output to file csv_writer.writerow(output)
>
> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
>
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?
>
> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?
>
> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
>
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?
>
I'm guessing that the idea is to load the output CSV into a database.
If that's the case, why not load the input CSV into some kind of staging
table in the database first, and do the processing there?
More information about the Python-list
mailing list