[Python-Dev] unifying str and unicode

Mon Oct 3 19:39:57 CEST 2005

Hi,

Josiah:
> > How can you be sure that something that is /semantically textual/ will
> > always remain "pure ASCII" ? That's contradictory, unless your software
> > never goes out of the anglo-saxon world (and even...).
> 
> Non-unicode text input widgets.

You didn't understand my statement.
I didn't mean :
  - how can you /technically enforce/ no unicode text at all but :
  - how can you be sure that your users will never /want/ to enter some
text that can't be represented with the current 8-bit charset?

Of course the answer to the latter is: you can't.

Fredrik:
> Under the default encoding (and quite a few other encodings), that's true for
> plain ascii strings and Unicode strings.

If I have an unicode string containing legal characters greater than
0x7F, and I pass it to a function which converts it to str, the
conversion fails.

If I have an 8-bit string containing legal non-ascii characters in it
(for example the name of a file as returned by the filesystem, which I
of course have no prior control on), and I give it to a function which
does an implicit conversion to unicode, the conversion fails.

Here is an example so that you really understand. I am under a French
locale (iso-8859-15), let's just try to enter a French word and see what
happens when converting to unicode:

-> As a string constant:

>>> s = "été"
>>> s
'\xe9t\xe9'
>>> u = unicode(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

-> By asking for input:

>>> s = raw_input()
été
>>> s
'\xe9t\xe9'
>>> unicode(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

It should work, but it fails miserably.

In the current situation, if the programmer doesn't carefully plan for
these cases by manually managing conversions (which of course he can do
- but it's boring and bothersome - not to mention that many programmers
do not even understand the issue!), some users will see the program die
with a nasty exception, just because they happen to need a bit more than
the plain latin alphabet without diacritics.

(even the standard Python library is bitten: witness the weird
getcwd() / getcwdu() pair...)

I find it surprising that you claim there is no difficulty when
everything points to the contrary. See for example how often confused
developers ask for help on mailing-lists...

Regards

Antoine.