[I18n-sig] Re: [Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicodecharacters

M.-A. Lemburg mal@lemburg.com
Mon, 02 Jul 2001 21:12:45 +0200


Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> > >     Character
> > >
> > >         Used by itself, means the addressable units of a Python
> > >         Unicode string.
> >
> > Please add: also known as "code unit".
> 
> I'm not entirely comfortable with that. As you yourself pointed out, the
> same Python Unicode object can be interpreted as either a series of
> single-width code points *or* as a UTF-16 string where the characters
> are code units. You could also interpet it as a BASE64'd region or an
> XML document... It all depends on how you look at it.

Well, that's what code unit tries to capture too: it's the basic storage
unit used by the implementation for storing characters. Never mind, it's
just a detail...
 
> > ....
> > >     Surrogate pair
> > >
> > >         Two physical characters that represent a single logical
> >
> > Eeek... two code units (or have you ever seen a physical character
> > walking around ;-)
> 
> No, that's sort of my point. The user can decide to adopt the convention
> of looking at the two characters as code units or they can ignore that
> interpretation and look at them as two code points. It's all relative,
> man. Dig it? That's why I use the word "convention" below:

Ok.
 
> > >         character. Part of a convention for representing 32-bit
> > >         code points in terms of two 16-bit code points.
> 
> "Surrogates are all in your head. Python doesn't know or care about
> them!"
> 
> I'll change this to:
> 
>     Surrogate pair
> 
>         Two Python Unicode characters that represent a single logical
>         Unicode code point. Part of a convention for representing
>         32-bit code points in terms of two 16-bit code points. Python
>         has limited support for reading, writing and constructing
> strings
>         that use this convention (described below). Otherwise Python
>         ignores the convention.

Good.
 
> > No need to pass this information to the codec: simply write
> > a new one and give it a clear name, e.g. "ucs-2" will generate
> > errors while "utf-16-le" converts them to surrogates.
> 
> That's a good point, but what if I want a UTF-8 codec that doesn't
> generate surrogates? Or even a UCS4 one?

With Walter's patch for callback error handlers, you should be able to
provide handlers which implement whatever you see fit. 
 
I think that codecs should work the same on all platforms and always
apply the needed conversion for the platform in question; could be wrong
though... it's really only a minor issue.

> > Plus perhaps the Mark Davis paper at:
> >
> > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> 
> Okay.
> 
> > > Copyright
> > >
> > >     This document has been placed in the public domain.
> >
> > Good work, Paul !
> 
> Thanks for your help. You did help me to clarify many things even though
> I argued with you as I was doing it.

Thank you for taking the suggestions into account.

-- 
Marc-Andre Lemburg
________________________________________________________________________
Business:                                        http://www.lemburg.com/
Python Pages:                             http://www.lemburg.com/python/