
[GvR, on string.encoding ]
Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information,
[JvR]
I may miss something, but the encoding attr just travels with the string object, no? Like I said in my reply to MAL, I think it's undesirable to do *anything* with the encoding attr if not in combination with a unicode string.
But just propagating affects every string op -- s+s, s*n, s[i], s[:], s.strip(), s.split(), s.lower(), ...
and there are too many sources of strings with unknown encodings to make it very useful.
That's why the default encoding must be settable as well, as Fredrik suggested.
I'm open for debate about this. There's just something about a changeable global default encoding that worries me -- like any global property, it requires conventions and defensive programming to make things work in larger programs. For example, a module that deals with Latin-1 strings can't just set the default encoding to Latin-1: it might be imported by a program that needs it to be UTF-8. This model is currently used by the locale in C, where all locale properties are global, and it doesn't work well. For example, Python needs to go through a lot of hoops so that Python numeric literals use "." for the decimal indicator even if the user's locale specifies "," -- we can't change Python to swap the meaning of "." and "," in all contexts. So I think that a changeable default encoding is of limited value. That's different from being able to set the *source file* encoding -- this only affects Unicode string literals.
Plus, it would slow down 8-bit string ops.
Not if you ignore it most of the time, and just pass it along when concatenating.
And slicing, and indexing, and...
I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code.
Explain that to newbies... I guess is that they will want simple 8 bit strings in their native encoding. Dunno.
If they are hap-py with their native 8-bit encoding, there's no need for them to ever use Unicode objects in their program, so they should be fine. 8-bit strings aren't ever interpreted or encoded except when mixed with Unicode objects.
If the source encoding is known, these will be converted using the appropriate codec.
If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"...").
Only if "\377" would still yield an 8-bit string, for binary goop...
Correct. --Guido van Rossum (home page: http://www.python.org/~guido/)