Is Unicode support so hard...

Sat Apr 20 14:14:50 EDT 2013

On Sun, Apr 21, 2013 at 3:22 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
> I'm totally confused about what you are saying.  What does "make a better
> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
> In what way?  Can you provide specifics?  Or are you saying that you like
> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
> know what you mean.

You're not familiar with jmf? He's one of our resident trolls. Allow
me to summarize Python 3's Unicode support...

>From 3.0 up to and including 3.2.x, Python could be built as either
"narrow" or "wide". A wide build consumes four bytes per character in
every string, which is rather wasteful (given that very few strings
actually NEED that); a narrow build gets some things wrong. (I'm using
a 2.7 here as I don't have a narrow-build 3.x handy; the same
considerations apply, though.)

Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> len(u"asdf\U00012345qwer")
10
>>> u"asdf\U00012345qwer"[8]
u'e'

In a narrow build, strings are stored in UTF-16, so astral characters
count as two. This means that a program will behave unexpectedly
differently on different platforms (other languages, such as
ECMAScript, actually *mandate* UTF-16; at least this means you can
depend on this otherwise-bizarre behaviour regardless of what platform
you're on), and I have to say this is counter-intuitive.

Enter Python 3.3 and PEP 393 strings. Now *EVERY* Python build is,
conceptually, wide. (I'm not sure how PEP 393 applies to other Pythons
- Jython, PyPy, etc - so assume that whenever I refer to Python, I'm
restricting this to CPython.) The underlying representation might be
more efficient, but to the script, it's exactly the same as a wide
build. If a string has no characters that demand more width, it'll be
stored nice and narrow. (It's the same technique that Pike has been
using for a while, so it's a proven system; in any case, we know that
this is going to work, it's just a question of performance - it adds a
fixed overhead.) Great! We save memory in Python programs. Wonderful!
Right?

Enter jmf. No, it's not wonderful, because OBVIOUSLY Python is now
America-centric, because now the full Unicode range is divided into
"these ones get stored in 1 byte per char, these in 2, these in 4".
Clearly that's making life way worse for everyone else. Also, compared
to the narrow build that jmf was previously using, this uses heaps
MORE space in the stupid micro-benchmarks that he keeps on trotting
out, because he has just one astral character in a sea of ASCII. And
that's totally what programs are doing all the time, too. Never mind
that basic operations like length, slicing, etc are no longer buggy,
no, Python has taken a terrible step backwards here.

Oh, and check this out:

>>> def munge(s):
	"""Move characters around in a string."""
	l=len(s)//4
	return s[:l]+s[l*2:l*3]+s[l:l*2]+s[l*3:]

>>> munge("asdfqwerzxcv1234")
'asdfzxcvqwer1234'

Looks fine.

>>> munge(u"asd\U00012345we\U00034567xc\U00023456bla")
u'asd\U00012167xc\U00023745we\U00034456bla'

Where'd those characters come from? I was just moving stuff around,
right? I can't get new characters out of it... can I?

Flash forward to current date, and jmf has hijacked so many threads to
moan about PEP 393 that I'm actually happy about this one, simply
because he gave it a new subject line and one appropriate to a
discussion about Unicode.

ChrisA