[Python-Dev] Python 3.0.1 (io-in-c)

Wed Jan 28 11:55:16 CET 2009

Hello,

Raymond Hettinger <python <at> rcn.com> writes:
> 
> >                                        MB/S     MB/S    MB/S
> >                                        in C  in py3k  in 2.7 C/3k 2.7/3k
> > ** Text append **
> >  10M write 1e6 units at a time        261.00 218.000 1540.000 1.20  7.06
> >  20K write one unit at a time           0.983  0.081    1.33 12.08 16.34
> > 400K write 20 units at a time          16.000  1.510   22.90 10.60 15.17
> > 400K write 4096 units at a time       236.00 118.000 1244.000 2.00 10.54
> 
> Do you know why the text-appends fell off so much in the 1st and last cases?

When writing large chunks of text (4096, 1e6), bookkeeping costs become
marginal and encoding costs dominate. 2.x has no encoding costs, which
explains why it's so much faster.

A quick test tells me utf-8 encoding runs at 280 MB/s. on this dataset (the
400KB text file). You see that there is not much left to optimize on large
writes.

> > ** Text input **
> >  10M read whole contents at once       89.700 68.700  966.000 1.31 14.06
> >  20K read whole contents at once      108.000 70.500 1196.000 1.53 16.96
>            ...
> > 400K read one line at a time           71.700  3.690  207.00 19.43 56.10
>           ...
> > 400K read whole contents at once      112.000 81.000  841.000 1.38 10.38
> > 400K seek forward 1000 units at a time 87.400 67.300  589.000 1.30  8.75
> > 400K seek forward one unit at a time    0.090  0.071    0.873 1.28 12.31
> 
> Looks like most of these still have substantial falloffs in performance.
> Is this part still a work in progress or is this as good as its going to get?

There is nothing left obvious to optimize in the read() department.
Decoding and newline translation costs dominate. Decoding has already been 
optimized for the most popular encodings in py3k:
http://mail.python.org/pipermail/python-checkins/2009-January/077024.html

Newline translation follows a fast path depending on various heuristics.

I also took particular care of the "read one line at a time" scenario because
it's the most likely idiom when reading a text file. I think there is hardly
anything left to optimize on this one. Your eyes are welcome, though.

Note that the benchmark is run with the following default settings for text
I/O: utf-8 encoding, universal newlines enabled, text containing only "\n" 
newlines.
You can play with settings here:
http://svn.python.org/view/sandbox/trunk/iobench/

Text seek() and tell(), on the other hand, is known to be slow, and it could
perhaps be improved. It is assumed, however, that they won't be used a lot
for text files.

> > ** Text overwrite **
> >  20K modify one unit at a time          0.296  0.072    1.320 4.09 18.26
> > 400K modify 20 units at a time          5.690  1.360   22.500 4.18 16.54
> > 400K modify 4096 units at a time      151.000 88.300  509.000 1.71  5.76
> 
> Same question on this batch.

There seems to be some additional overhead in this case. Perhaps it could be
improved, I'll have to take a look... But I doubt overwriting chunks of text
is a common scenario.

Regards

Antoine.