[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sat Apr 25 11:00:24 CEST 2009

2009/4/25 James Y Knight <foom at fuhm.net>:
> On Apr 24, 2009, at 6:05 PM, Paul Moore wrote:
>>
>> - Windows systems where broken Unicode (lone surrogates or whatever)
>> isn't involved
>> - Unix systems where the user's stated filesystem encoding is correct
>>
>> Can you honestly say that this isn't the vast majority of real-world
>> environments? (IIRC, you are based in Japan, so it may well be true
>> that the likelihood of problems is a lot higher where you are than
>> where I am - the UK - but I suspect that averaging out, things are
>> generally as above).
>
> In my experience, it is normal on most unix systems that some programs
> (mostly daemons) are running in default "POSIX" locale, others (most user
> programs) are running in the "en_US.utf-8" locale, and some luddite users
> have set themselves to "en_US.8859-1". All running on the same system.

OK, thanks for the data point.

Following on from that, would this (under Martin's proposal) result in
programs receiving encoded strings, or just semantically-incorrect
ones?

Specifically, the 8859-1 case cannot result in encoded strings, as
8859-1 can represent all byte strings (possibly garbled, but at least
validly). The utf8 case can hit unrepresentable bytes, but only if
there are characters greater than 0x7F in filenames. Is the "POSIX"
case ASCII? If so, then the same logic (>=0x80 is unrepresentable).

So, the next question is - do people on such systems frequently use
high-bit characters in filenames?

Paul.

PS Unfortunately, I suspect that the biggest group of people likely to
be hit badly by this is people using non-latin scripts. And arguing
probabilities without real data is optimistic at best. But those
people are also the *least* likely people to contribute on an
English-speaking list, I guess :-( (Sincere apologies if everyone but
me on this list happens to actually be fluent English-speaking
Russians :-))