[Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
Just van Rossum
just@letterror.com
Thu, 4 May 2000 22:22:38 +0100
(Boy, is it quiet here all of a sudden ;-)
Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.
These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
happen purely on the basis of the character codes; no
assumption at all should be made about the encoding, ie.
what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
narrow or wide
The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
since it is not neccesary for the character to be a Unicode
code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
contains a character > 255 the string should automatically
become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
- it is possible to specify a source encoding. I'm not sure if
this is best done through an extra argument for encode()
or that it should be a new method, eg. transcode().
- s.encode() or s.transcode() are allowed to output a wide
string, as in aNarrowString.encode("UCS-2") and
s.transcode("Mac-Roman", "UCS-2").
My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.
Just