[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sat Apr 25 17:00:17 CEST 2009

> OK, looks like my analysis matches yours, except that I wasn't sure if
> the third case (a string that "likely wasn't intended") could result
> in exceptions. From what you're saying, it sounds like it would
> actually be similar to the second case - I'm not clear on how
> surrogates work, though.

On decoding, there is a guarantee that it decodes successfully. There is
also a guarantee that the result will re-encode successfully, and yield
the same byte string.

If you pass a different string into encoding, you still may get
exceptions. For example, if the filesystem encoding is latin-1,
passing u"\u20ac" will continue to raise exceptions, even under the
python-escape error handler - that error handler will only handle
surrogates.

There isn't really that much trickery to surrogates. They *have*
to come in pairs to be meaningful, with the first one in the range
D800..DBFF (high surrogate), and the second in the range DC00..DCFF
(low surrogate). Having a lone low surrogate is not meaningful; this
is how the escaping works.

Proper surrogate pairs encode characters outside the BMP, for use with
UTF-16: each code contributes 10 bits (just count how many codes there
are in D800..DCFF), together, a pair encodes 20 bits, allowing for
2**20 characters, starting at U+10000.

>> When they find that the files they created are inaccessible to others,
>> they will often stop using funny characters.
> 
> Which sounds fairly practical - and the irony of someone with a "funny
> character" in his surname telling me this hasn't escaped me :-)

Sure: my Unix account name was always "loewis", and even on Windows,
our admins didn't dare to put the umlaut into the account name - it
would be difficult to login with a US keyboard, for example. People
who use non-ASCII characters in filenames around here are primarily
non-IT people who aren't aware that these characters are different
from the rest.

I recognize that for other languages (without trivial transliterations)
the problem is more severe, and people are more likely to create
files with Cyrillic, or Japanese, names (say) if the systems accepts
them at all.

Regards,
Martin