[Python-Dev] Support for "wide" Unicode characters
Paul Prescod
paulp@ActiveState.com
Fri, 29 Jun 2001 18:16:25 -0700
"M.-A. Lemburg" wrote:
>
>...
>
> I'd suggest not to use the term character in this PEP at all;
> this is also what Mark Davis recommends in his paper on Unicode.
That's fine, but Python does have a concept of character and I'm going
to use the term character for discussing these.
> Also, a link to the Unicode glossary would be a good thing.
Funny how these little PEPs grow...
>...
> Why not make the codec used by Python to convert Unicode
> literals to Unicode strings an option just like the default
> encoding ?
>
> That way we could have a version of the unicode-escape codec
> which supports surrogates and one which doesn't.
Adding more and more knobs to tweak just adds up to Python code being
non-portable from one machine to another.
> > ISSUE: Should Python allow the construction of characters
> > that do not correspond to Unicode characters?
> > Unassigned Unicode characters should obviously be legal
> > (because they could be assigned at any time). But
> > code points above TOPCHAR are guaranteed never to
> > be used by Unicode. Should we allow access to them
> > anyhow?
>
> I wouldn't count on that last point ;-)
>
> Please note that you are mixing terms: you don't construct
> characters, you construct code points. Whether the concatenation
> of these code points makes a valid Unicode character string
> is an issue which applications and codecs have to decide.
unichr() does not construct code points. It constructs 1-char Python
Unicode strings...also known as Python Unicode characters.
> ... Whether the concatenation
> of these code points makes a valid Unicode character string
> is an issue which applications and codecs have to decide.
The concatenation of true code points would *always* make a valid
Unicode string, right? It's code units that cannot be blindly
concatenated.
>...
> We should provide a new module which provides a few handy
> utilities though: functions which provide code point-,
> character-, word- and line- based indexing into Unicode
> strings.
Okay, I'll add:
It has been proposed that there should be a module for working
with UTF-16 strings in narrow Python builds through some sort of
abstraction that handles surrogates for you. If someone wants
to implement that, it will be another PEP.
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook