[Tutor] concurrent file reading using python

Mon Mar 26 21:02:52 CEST 2012

> I want to utilize the power of cores on my server and read big files
> (> 50Gb) simultaneously by seeking to N locations. Process each
> separate chunk and merge the output. Very similar to MapReduce
> concept.
> 
> What I want to know is the best way to read a file concurrently. I
> have read about file-handle.seek(),  os.lseek() but not sure if thats
> the way to go. Any used cases would be of help.
> 
> PS: did find some links on stackoverflow but it was not clear to me if
> I found the right solution.
>

Have you done any testing in this space? I would assume 
you would be memory/IO bound and not CPU bound. Using 
multiple cores would not help non-CPU bound tasks.

I would try and write an initial program that does what
you want without attempting to optimize and then do some
profiling to see if you are using waiting on the CPU
or if you are (as I suspect) waiting on hard disk / memory.

Actually, if you only need small chunks of the file at 
a time and you iterate over the file (for line in file-handle:)
instead of using file-handle.readlines() you will 
probably only be IO bound due to the way Python file 
handling works.

But either way, test first then optimize. :)

Ramit

Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.