python-unicode doesn't support >65535 symbols?

Martin v. Löwis martin at v.loewis.de
Thu Nov 27 14:12:06 EST 2003


"Rainer Deyke" <rainerd at eldwood.com> writes:

> Python makes the mistake of exposing the internal representation instead of
> the logical value of unicode objects.  This means that, aside from space
> optimization, unicode objects have no advantage over UTF-8 encoded plain
> strings for storing unicode text.

That is not true. First, it is not "Python", but a specific Python
configuration - in "wide Unicode" builds, it uses UCS-4 internally.

In either build, len() and indexing addresses code units, not
characters: that is true.

However, it is not true that there is no advantage over UTF-8 encoded
byte strings. Instead, there are several advantages:
- In a UCS-4 build, Unicode characters and code units are in a 1:1 
  relationship
- In a UCS-2 build, Unicode characters and code units are in a 1:1
  relationship as long as the application only ever processes BMP
  characters.
- In either case, a Unicode object has inherent information about the
  character set, which a UTF-8 byte string does not have. IOW, you know
  what a Unicode object is, but you don't know (inherently) whether a
  byte string is UTF-8.

Regards,
Martin





More information about the Python-list mailing list