[Python-3000] Unicode and OS strings

Thu Sep 13 19:08:40 CEST 2007

> Yes, I have noticed this too. Environment variables, command line
> arguments, locale properties, TZ names, and so on, are often given as
> 8-bit strings in who knows what encoding. I'm not sure what the
> solution is, but we need one.

One "universal" solution is to use Unicode private-use-area
characters. We could come up with some error handler which replaces
undecodable characters with a PUA character; on encoding, the
same error handler encodes the PUA characters again as bytes.
We would need a block of 256 PUA characters for that.

Of course, if the input data already contains PUA characters,
there would be an ambiguity. We can rule this out for most codecs,
as they don't support PUA characters. The major exception would
be UTF-8, for which we would need to create a UTF-8-noPUA codec,
which would then be used at all system interfaces that should use
UTF-8 but might use arbitrary bytes.

We would make a list of all interfaces that use the PUA error
handler: file names, environment variables, command line
arguments.

> I'm guessing one thing we need to do is
> research how various systems decide what encoding to use. Even on OSX,
> I managed to create an environment variable containing non-ASCII
> non-UTF-8 bytes.

Unix-ish systems just don't decide. They pass that on to the
application. On display, they display things like question marks. At
API level, it's just null-terminated char*.

> I believe Tcl/Tk used to have some kind of heuristic where they would
> try UTF-8 first and if that failed used Latin-1 for the bytes that
> aren't valid UTF-8, but I'm not at all sure that that's the right
> solution in places where Latin-1 is not spoken.

Indeed not - here lies moji-bake.

Regards,
Martin