Re: [I18n-sig] Re: [Python-Dev] Re: Unicode debate

29 Apr 2000

      ...
[GvR, on string.encoding ]
...
Marc-Andre took this idea a bit further, but I think it's not
practical given the current implementation: there are too many places
where the C code would have to be changed in order to propagate the
string encoding information,
[JvR]
...
I may miss something, but the encoding attr just travels with the string
object, no? Like I said in my reply to MAL, I think it's undesirable to do
*anything* with the encoding attr if not in combination with a unicode
string.
But just propagating affects every string op -- s+s, s*n, s[i], s[:],
s.strip(), s.split(), s.lower(), ...
...
...
and there are too many sources of strings
with unknown encodings to make it very useful.
That's why the default encoding must be settable as well, as Fredrik
suggested.
I'm open for debate about this.  There's just something about a
changeable global default encoding that worries me -- like any global
property, it requires conventions and defensive programming to make
things work in larger programs.  For example, a module that deals with
Latin-1 strings can't just set the default encoding to Latin-1: it
might be imported by a program that needs it to be UTF-8.  This model
is currently used by the locale in C, where all locale properties are
global, and it doesn't work well.  For example, Python needs to go
through a lot of hoops so that Python numeric literals use "." for the
decimal indicator even if the user's locale specifies "," -- we can't
change Python to swap the meaning of "." and "," in all contexts.

So I think that a changeable default encoding is of limited value.
That's different from being able to set the *source file* encoding --
this only affects Unicode string literals.
...
...
Plus, it would slow down 8-bit string ops.
Not if you ignore it most of the time, and just pass it along when
concatenating.
And slicing, and indexing, and...
...
...
I have a better idea: rather than carrying around 8-bit strings with
an encoding, use Unicode literals in your source code.
Explain that to newbies... I guess is that they will want simple 8 bit
strings in their native encoding. Dunno.
If they are hap-py with their native 8-bit encoding, there's no need
for them to ever use Unicode objects in their program, so they should
be fine.  8-bit strings aren't ever interpreted or encoded except when
mixed with Unicode objects.
...
...
If the source
encoding is known, these will be converted using the appropriate
codec.
If you object to having to write u"..." all the time, we could say
that "..." is a Unicode literal if it contains any characters with the
top bit on (of course the source file encoding would be used just like
for u"...").
Only if "\377" would still yield an 8-bit string, for binary goop...
Correct.

--Guido van Rossum (home page: http://www.python.org/~guido/)