[Tutor] concurrent file reading using python

Abhishek Pratap abhishek.vit at gmail.com
Tue Mar 27 00:46:53 CEST 2012


Thanks Walter and  Steven for the insight. I guess I will post my
question to python main mailing list and see if people have anything
to say.

-Abhi

On Mon, Mar 26, 2012 at 3:28 PM, Walter Prins <wprins at gmail.com> wrote:
> Abhi,
>
> On 26 March 2012 19:05, Abhishek Pratap <abhishek.vit at gmail.com> wrote:
>> I want to utilize the power of cores on my server and read big files
>> (> 50Gb) simultaneously by seeking to N locations. Process each
>> separate chunk and merge the output. Very similar to MapReduce
>> concept.
>>
>> What I want to know is the best way to read a file concurrently. I
>> have read about file-handle.seek(),  os.lseek() but not sure if thats
>> the way to go. Any used cases would be of help.
>
> Your idea won't work.  Reading from disk is not a CPU-bound process,
> it's an I/O bound process.  Meaning, the speed by which you can read
> from a conventional mechanical hard disk drive is not constrained by
> how fast your CPU is, but generally by how fast your disk(s) can read
> data from the disk surface, which is limited by the rotation speed and
> areal density of the data on the disk (and the seek time), and by how
> fast it can shovel the data down it's I/O bus.  And *that* speed is
> still orders of magnitude slower than your RAM and your CPU.  So, in
> reality even just one of your cores will spend the vast majority of
> its time waiting for the disk when reading your 50GB file.  There's
> therefore __no__ way to make your file reading faster by increasing
> your __CPU cores__ -- the only way is by improving your disk I/O
> throughput.  You can for example stripe several hard disks together in
> RAID0 (but that increases the risk of data loss due to data being
> spread over multiple drives) and/or ensure you use a faster I/O
> subsystem (move to SATA3 if you're currently using SATA2 for example),
> and/or use faster hard disks (use 10,000 or 15,000 RPM instead of
> 7,200, or switch to SSD [solid state] disks.)  Most of these options
> will cost you a fair bit of money though, so consider these thoughts
> in that light.
>
> Walter


More information about the Tutor mailing list