[I18n-sig] Unicode strings: an alternative

Just van Rossum just@letterror.com
Thu, 4 May 2000 22:22:38 +0100

(Boy, is it quiet here all of a sudden ;-)

Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.

These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
  happen purely on the basis of the character codes; no
  assumption at all should be made about the encoding, ie.
  what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
  there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
  or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
  narrow or wide

The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
  since it is not neccesary for the character to be a Unicode
  code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
  contains a character > 255 the string should automatically
  become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
  - it is possible to specify a source encoding. I'm not sure if
    this is best done through an extra argument for encode()
    or that it should be a new method, eg. transcode().
  - s.encode() or s.transcode() are allowed to output a wide
    string, as in aNarrowString.encode("UCS-2") and
    s.transcode("Mac-Roman", "UCS-2").

My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.