[Python-Dev] 2.2 Unicode questions
Mon, 23 Jul 2001 12:36:38 +0200
Fredrik Lundh wrote:
> MAL wrote:
> > To simplify the picture: the implementation itself only sees
> > UCS-2 or UCS-4 depending on the compile time option and these
> > do not treat surrogates in any special way except reserve
> > code points for their usage. Accordingly, unichr() should not
> > create UTF-16 but UCS-2 for narrow builds and UCS-4 on wide
> > builds
> you didn't answer my question: is there any reason why
> unichr(0xXXXXXXXX) shouldn't return exactly the same
> thing as "\UXXXXXXXX" ?
> in 2.0 and 2.1, it doesn't. in 2.2, it does.
> > (unichr() is a contructor for code units, not code points).
Doesn't this answer your question ? The point I wanted to
make was that unichr() is a constructor for a single code unit
just like chr() is a constructor for a single code unit -- in
that sense the storage format used by the implementation defines
the outcome: for UCS-2 builds, it can only create UCS-2 values,
for UCS-4 builds, UCS-4 values are possible as well.
The question of u"\UXXXXXXXX" creating surrogates on UCS-2
builds is different: \UXXXXXXXX is an encoding of a Unicode
code point, so the codec has to decide whether or not to
map this to two code units or an exception on UCS-2 builds.
> really? according to the documentation, it creates unicode
> *characters*. so does \U, according to the documentation.
> imo, it makes more sense to let "characters" mean code points
> than code units, but that's me.
The term "character" is vastly overloaded. There are three
different forms of interpretation: graphemes (this is what
a user usually sees as character on her display), codec points
(this is what Unicode encodes) and code units (this is what
the implementation uses a atom for storing code points).
Since Python exposes code units (u gives you direct access
to the implementation defined storage area) and makes no
assumption about surrogates, it would not be a good idea to
suddenly introduce a break in the meaning of the outcome of
indexing into a Unicode string (u) and len(unichr()).
I know that the name unichr() does not help in this situation,
the correct name would be unicodeunit().
> the important thing here is to
> figure out if \U and unichr are the same thing, and fix the code
> and the documentation to do/say what we mean.
Note that apart from agreeing on a common meaning, we should
also think about the consequences of breaking len(unichr())==1,
e.g. when creating a Unicode string using unichr() you'd expect
to find the generated code unit at the position you appended
it to the Unicode object.
CEO eGenix.com Software GmbH
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/