Flexible string representation, unicode, typography, ...

Wed Aug 29 00:15:31 EDT 2012

On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:
> In summary:
> 1. The problem is not on jmf's computer
> 2. It is not windows-only
> 3. It is not directly related to latin-1 encodable or not
>
> The only question which is not yet clear is this:
> Given a typical string operation that is complexity O(n), in more
> detail it is going to be O(a + bn)
> If only a is worse going 3.2 to 3.3, it may be a small issue.
> If b is worse by even a tiny amount, it is likely to be a significant
> regression for some use-cases.

As has been pointed out repeatedly already, this is a microbenchmark.
jmf is focusing in one one particular area (string construction) where
Python 3.3 happens to be slower than Python 3.2, ignoring the fact
that real code usually does lots of things other than building
strings, many of which are slower to begin with.  In the real-world
benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
Here's a much more realistic benchmark that nonetheless still focuses
on strings: word counting.

Source: http://pastebin.com/RDeDsgPd

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 310 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 302 usec per loop

"unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
of Unicode characters that I pulled off the web.  Even though this
program is still mostly string processing, Python 3.3 wins.  Of
course, that's not really a very good test -- since it reads the file
on every pass, it probably spends more time in I/O than it does in
actual processing.  Let's try it again with prepared string data:

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 87.3 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 84.6 usec per loop

Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
decided to try it again using str.split() instead of a StringIO.
Since str.split() creates more strings, I expect Python 3.2 might
actually win this time.

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 88 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 76.5 usec per loop

Interestingly, although Python 3.2 performs the splits in about the
same time as the StringIO operation, Python 3.3 is significantly
*faster* using str.split(), at least on this data set.

> So doing some arm-chair thinking (I dont know the code and difficulty
> involved):
>
> Clearly there are 3 string-engines in the python 3 world:
> - 3.2 narrow
> - 3.2 wide
> - 3.3 (flexible)
>
> How difficult would it be to giving the choice of string engine as a
> command-line flag?
> This would avoid the nuisance of having two binaries -- narrow and
> wide.

Quite difficult.  Even if we avoid having two or three separate
binaries, we would still have separate binary representations of the
string structs.  It makes the maintainability of the software go down
instead of up.

> And it would give the python programmer a choice of efficiency
> profiles.

So instead of having just one test for my Unicode-handling code, I'll
now have to run that same test *three times* -- once for each possible
string engine option.  Choice isn't always a good thing.

Cheers,
Ian