[Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
Toby Dickenson
tdickenson@geminidataloggers.com
Fri, 05 May 2000 10:07:46 +0100
On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just@letterror.com> wrote:
>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points,=
to
>separate them from my implementation proposal, as that's just what it =
is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
> happen purely on the basis of the character codes; no
> assumption at all should be made about the encoding, ie.
> what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
> there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
> or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
> narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
> since it is not neccesary for the character to be a Unicode
> code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
> contains a character > 255 the string should automatically
> become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
> - it is possible to specify a source encoding. I'm not sure if
> this is best done through an extra argument for encode()
> or that it should be a new method, eg. transcode().
> - s.encode() or s.transcode() are allowed to output a wide
> string, as in aNarrowString.encode("UCS-2") and
> s.transcode("Mac-Roman", "UCS-2").
One other pleasant consequence:
- String comparisons work character-by character, even if the
representation of those characters have different widths.
>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide =
some
>additional C compatibility (especially now that silent conversion to =
utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.
Toby Dickenson
tdickenson@geminidataloggers.com