list comprehension help

rkmr.em at rkmr.em at
Mon Mar 19 05:58:33 CET 2007

On 3/18/07, Alex Martelli <aleax at> wrote:
> George Sakkis <george.sakkis at> wrote:
> > On Mar 18, 12:11 pm, "rkmr... at" <rkmr... at> wrote:
> > > I need to process a really huge text file (4GB) and this is what i
> > > need to do. It takes for ever to complete this. I read some where that
> > > "list comprehension" can fast up things. Can you point out how to do
> > > f = open('file.txt','r')
> > > for line in f:
> > >         db[line.split(' ')[0]] = line.split(' ')[-1]
> > >         db.sync()
> > You got several good suggestions; one that has not been mentioned but
> > makes a big (or even the biggest) difference for large/huge file is
> > the buffering parameter of open(). Set it to the largest value you can
> > afford to keep the I/O as low as possible. I'm processing 15-25 GB
> > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
> > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> > compared to the default value. BerkeleyDB should have a buffering
> Out of curiosity, what OS and FS are you using?  On a well-tuned FS and

Fedora Core 4 and ext 3. Is there something I should do to the FS?

> OS combo that does "read-ahead" properly, I would not expect such
> improvements for moving from large to huge buffering (unless some other
> pesky process is perking up once in a while and sending the disk heads
> on a quest to never-never land).  IOW, if I observed this performance
> behavior on a server machine I'm responsible for, I'd look for
> system-level optimizations (unless I know I'm being forced by myopic
> beancounters to run inappropriate OSs/FSs, in which case I'd spend the
> time polishing my resume instead) - maybe tuning the OS (or mount?)
> parameters, maybe finding a way to satisfy the "other pesky process"
> without flapping disk heads all over the prairie, etc, etc.
> The delay of filling a "1 GB or more" buffer before actual processing
> can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
> that is, something bad is seriously interfering with the normal
> read-ahead system level optimization... and in that case I'd normally be
> more interested in finding and squashing the "something bad", than in
> trying to work around it by overprovisioning application bufferspace!-)

Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.

More information about the Python-list mailing list