parallel csv-file processing
Paul Rubin
http
Fri Nov 9 06:02:34 EST 2007
Michel Albert <exhuma at gmail.com> writes:
> buffer = []
> for line in reader:
> buffer.append(line)
> if len(buffer) == 1000:
> f = job_server.submit(calc_scores, buffer)
> buffer = []
>
> f = job_server.submit(calc_scores, buffer)
> buffer = []
>
> but would this not kill my memory if I start loading bigger slices
> into the "buffer" variable?
Why not pass the disk offsets to the job server (untested):
n = 1000
for i,_ in enumerate(reader):
if i % n == 0:
job_server.submit(calc_scores, reader.tell(), n)
the remote process seeks to the appropriate place and processes n lines
starting from there.
More information about the Python-list
mailing list