[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis" martin at v.loewis.de
Tue Apr 28 19:00:37 CEST 2009

> Since the serialization of the Unicode string is likely to use UTF-8,
> and the string for  such a file will include half surrogates, the
> application may raise an exception when encoding the names for a
> configuration file. These encoding exceptions will be as rare as the
> unusual names (which the careful I18N aware developer has probably
> eradicated from his system), and thus will appear late.

There are trade-offs to any solution; if there was a solution without
trade-offs, it would be implemented already.

The Python UTF-8 codec will happily encode half-surrogates; people argue
that it is a bug that it does so, however, it would help in this
specific case.

An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.

> Or say de/serialization succeeds. Since the resulting Unicode string
> differs depending on the encoding (which is a good thing; it is
> supposed to make most cases mostly readable), when the filesystem
> encoding changes (say from legacy to UTF-8), the "name" changes, and
> deserialized references to it become stale.

That problem has nothing to do with the PEP. If the encoding changes,
LRU entries may get stale even if there were no encoding errors at
all. Suppose the old encoding was Latin-1, and the new encoding is
KOI8-R, then all file names are decodable before and afterwards, yet
the string representation changes. Applications that want to protect
themselves against that happening need to store byte representations
of the file names, not character representations. Depending on the
configuration file format, that may or may not be possible.

I find the case pretty artificial, though: if the locale encoding
changes, all file names will look incorrect to the user, so he'll
quickly switch back, or rename all the files. As an application
supporting a LRU list, I would remove/hide all entries that don't
correlate to existing files - after all, the user may have as well
deleted the file in the LRU list.


More information about the Python-Dev mailing list