[Python-3000] String comparison

Rauli Ruohonen rauli.ruohonen at gmail.com
Thu Jun 14 14:34:06 CEST 2007


On 6/14/07, Guido van Rossum <guido at python.org> wrote:
> On 6/13/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).

As the word "character" is ambiguous, I'd put it this way:

- code point: the smallest unit Unicode deals with that's independent of
  encoding. Takes values in range(0, 0x110000)
- grapheme (or "grapheme cluster"): what users think of as characters. May
  consist of multiple code points, e.g. "ö" can be represented with one
  or two code points. Depends on the language the user speaks

> It sounds like we really use code units, not code points (except when
> building with the 4-byte Unicode option, when they are equivalent).

Not quite equivalent in current Python. From some past discussions I thought
this was by design, but now having seen this odd behavior, maybe it isn't:

>>> sys.maxunicode
1114111
>>> x = u'\ud840\udc21'
>>> marshal.loads(marshal.dumps(x)) == x
False
>>> pickle.loads(pickle.dumps(x, 2)) == x
False
>>> pickle.loads(pickle.dumps(x, 1)) == x
False
>>> pickle.loads(pickle.dumps(x)) == x
True
>>>

Pickling should work the same way regardless of protocol, right? And
probably should not modify the objects it pickles if it can help it.
The reason the above happens is that binary pickles use UTF-8 to encode
unicode, and this is what happens with codecs:

>>> u'\ud840\udc21' == u'\U00020021'
False
>>> u'\ud840\udc21'.encode('utf-8').decode('utf-8')
u'\U00020021'
>>> u'\ud840\udc21'.encode('punycode').decode('punycode')
u'\ud840\udc21'
>>> u'\ud840\udc21'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\U00020021'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\ud840\udc21'.encode('big5hkscs').decode('big5hkscs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5hkscs' codec can't encode character u'\ud840'
in position 0: illegal multibyte sequence
>>> u'\U00020021'.encode('big5hkscs').decode('big5hkscs')
u'\U00020021'
>>>

Should codecs treat u'\ud840\udc21' and u'\U00020021' the same even on
UCS-4 builds (like current UTF-8 and UTF-16 codecs do) or not (like current
punycode and big5hkscs codecs do)?


More information about the Python-3000 mailing list