[I18n-sig] Unicode strings: an alternative

Wed, 3 May 2000 20:55:24 +0100

Today I had a relatively simple idea that unites wide strings and narrow
strings in a way that is more backward comatible at the C level. It's quite
possible this has already been considered and rejected for reasons that are
not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the
existing string object like so:
- wide strings are stored as if they were narrow strings, simply using two
bytes for each Unicode character.
- there's a flag that specifies whether the string is narrow or wide.
- the ob_size field is the _physical_ length of the data; if the string is
wide, len(s) will return ob_size/2, all other string operations will have
to do similar things.
- there can possibly be an encoding attribute which may specify the used
encoding, if known.

Admittedly, this is tricky and involves quite a bit of effort to implement,
since all string methods need to have narrow/wide switch. To make it worse,
it hardly offers anything the current solution doesn't. However, it offers
one IMHO _big_ advantage: C code that just passes strings along does not
need to change: wide strings can be seen as narrow strings without any
loss. This allows for __str__() & str() and friends to work with unicode
strings without any change.

Any thoughts?

Just