[I18n-sig] Re: [Python-Dev] PEP 261, Rev 1.3 - Support for "wide"
Unicodecharacters
M.-A. Lemburg
mal@lemburg.com
Mon, 02 Jul 2001 21:12:45 +0200
Paul Prescod wrote:
>
> "M.-A. Lemburg" wrote:
> >
> >...
> > > Character
> > >
> > > Used by itself, means the addressable units of a Python
> > > Unicode string.
> >
> > Please add: also known as "code unit".
>
> I'm not entirely comfortable with that. As you yourself pointed out, the
> same Python Unicode object can be interpreted as either a series of
> single-width code points *or* as a UTF-16 string where the characters
> are code units. You could also interpet it as a BASE64'd region or an
> XML document... It all depends on how you look at it.
Well, that's what code unit tries to capture too: it's the basic storage
unit used by the implementation for storing characters. Never mind, it's
just a detail...
> > ....
> > > Surrogate pair
> > >
> > > Two physical characters that represent a single logical
> >
> > Eeek... two code units (or have you ever seen a physical character
> > walking around ;-)
>
> No, that's sort of my point. The user can decide to adopt the convention
> of looking at the two characters as code units or they can ignore that
> interpretation and look at them as two code points. It's all relative,
> man. Dig it? That's why I use the word "convention" below:
Ok.
> > > character. Part of a convention for representing 32-bit
> > > code points in terms of two 16-bit code points.
>
> "Surrogates are all in your head. Python doesn't know or care about
> them!"
>
> I'll change this to:
>
> Surrogate pair
>
> Two Python Unicode characters that represent a single logical
> Unicode code point. Part of a convention for representing
> 32-bit code points in terms of two 16-bit code points. Python
> has limited support for reading, writing and constructing
> strings
> that use this convention (described below). Otherwise Python
> ignores the convention.
Good.
> > No need to pass this information to the codec: simply write
> > a new one and give it a clear name, e.g. "ucs-2" will generate
> > errors while "utf-16-le" converts them to surrogates.
>
> That's a good point, but what if I want a UTF-8 codec that doesn't
> generate surrogates? Or even a UCS4 one?
With Walter's patch for callback error handlers, you should be able to
provide handlers which implement whatever you see fit.
I think that codecs should work the same on all platforms and always
apply the needed conversion for the platform in question; could be wrong
though... it's really only a minor issue.
> > Plus perhaps the Mark Davis paper at:
> >
> > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
>
> Okay.
>
> > > Copyright
> > >
> > > This document has been placed in the public domain.
> >
> > Good work, Paul !
>
> Thanks for your help. You did help me to clarify many things even though
> I argued with you as I was doing it.
Thank you for taking the suggestions into account.
--
Marc-Andre Lemburg
________________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/