[I18n-sig] Unicode strings: an alternative

Guido van Rossum guido@python.org
Wed, 03 May 2000 17:22:59 -0400

> Today I had a relatively simple idea that unites wide strings and narrow
> strings in a way that is more backward comatible at the C level. It's quite
> possible this has already been considered and rejected for reasons that are
> not yet obvious to me, but I'll give it a shot anyway.
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
> - there's a flag that specifies whether the string is narrow or wide.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.

This seems to have some nice properties, but I think it would cause
problems for existing C code that tries to *interpret* the bytes of a
string: it could very well do the wrong thing for wide strings (since
old C code doesn't check for the "wide" flag).  I'm not sure how much
C code there is that merely passes strings along...  Most C code using
strings makes use of the strings (e.g. open() falls in this category
in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)