[Python-Dev] Re: [I18n-sig] Unicode strings: an alternative

Toby Dickenson tdickenson@geminidataloggers.com
Fri, 05 May 2000 10:07:46 +0100


On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just@letterror.com> wrote:

>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points,=
 to
>separate them from my implementation proposal, as that's just what it =
is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
>  happen purely on the basis of the character codes; no
>  assumption at all should be made about the encoding, ie.
>  what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
>  there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
>  or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
>  narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
>  since it is not neccesary for the character to be a Unicode
>  code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
>  contains a character > 255 the string should automatically
>  become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
>  - it is possible to specify a source encoding. I'm not sure if
>    this is best done through an extra argument for encode()
>    or that it should be a new method, eg. transcode().

>  - s.encode() or s.transcode() are allowed to output a wide
>    string, as in aNarrowString.encode("UCS-2") and
>    s.transcode("Mac-Roman", "UCS-2").

One other pleasant consequence:

- String comparisons work character-by character, even if the
  representation of those characters have different widths.

>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide =
some
>additional C compatibility (especially now that silent conversion to =
utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.


Toby Dickenson
tdickenson@geminidataloggers.com