Abuse of subject, was Re: Abuse of Big Oh notation
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Tue Aug 21 13:16:06 EDT 2012
Le mardi 21 août 2012 09:52:09 UTC+2, Peter Otten a écrit :
> wxjmfauth at gmail.com wrote:
>
>
>
> > By chance and luckily, first attempt.
>
>
>
> > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
>
> > , 'œ')"
>
> > 1000000 loops, best of 3: 1.48 usec per loop
>
> > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
>
> > , 'œ')"
>
> > 100000 loops, best of 3: 7.62 usec per loop
>
>
>
> OK, that is roughly factor 5. Let's see what I get:
>
>
>
> $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
>
> 100000 loops, best of 3: 1.8 usec per loop
>
> $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
>
> 10000 loops, best of 3: 9.11 usec per loop
>
>
>
> That is factor 5, too. So I can replicate your measurement on an AMD64 Linux
>
> system with self-built 3.3 versus system 3.2.
>
>
>
> > Note
>
> > The used characters are not members of the latin-1 coding
>
> > scheme (btw an *unusable* coding).
>
> > They are however charaters in cp1252 and mac-roman.
>
>
>
> You seem to imply that the slowdown is connected to the inability of latin-1
>
> to encode "œ" and "€" (to take the examples relevant to the above
>
> microbench). So let's repeat with latin-1 characters:
>
>
>
> $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
>
> 100000 loops, best of 3: 1.76 usec per loop
>
> $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
>
> 10000 loops, best of 3: 10.3 usec per loop
>
>
>
> Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory
>
> that an unfortunate choice of the 8 bit encoding is causing it. Do you
>
> agree?
- I do not care too much about the numbers. It's
an attempt to show the principles.
- The fact, considering latin-1 as a bad coding,
lies on the point that is is simply unsuable
for some scripts / languages. It has mainly to do
with source/text files coding. This is not really
the point here.
- Now, the technical aspect. This "coding" (latin-1)
may be considered somehow as the pseudo-coding covering
the unicode code points range 128..255. Unfortunatelly,
this "coding" is not very optimal (or can be see as) when
you work with a full range of Unicode, but is is fine
when one works only in pure latin-1, with only 256
characters.
This range 128..255 is always the critical part
(all codings considered). And probably represents
the most used characters.
I hope, it was not too confused.
I have no proof for my theory. With my experience on that
field, I highly suspect this as the bottleneck.
Some os as before.
Py 3.2.3
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[1.5384088242603358, 1.532421642233382, 1.5327445924545433]
>>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")
[1.561762063667686, 1.5443503206462594, 1.5458670051605168]
3.3.0b2
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[7.701523104134512, 7.720358191179441, 7.614549852683501]>>>
>>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")
[4.887939423990709, 4.868787294350611, 4.865697999795991]
Quite mysterious!
In any way it is a regression.
jmf
More information about the Python-list
mailing list