[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 16:00:50 CEST 2009

On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nobody said we were at the stage of *saving* the [attachment]!

But speaking of saving files, I think that's the biggest hole in this
that has been nagging at the back of my mind. This PEP intends to
allow easy access to filenames and other environment strings which are
not restricted to known encodings. What happens if the detected
encoding changes? There may be difficulties de/serializing these
names, such as for an MRU list.

Since the serialization of the Unicode string is likely to use UTF-8,
and the string for  such a file will include half surrogates, the
application may raise an exception when encoding the names for a
configuration file. These encoding exceptions will be as rare as the
unusual names (which the careful I18N aware developer has probably
eradicated from his system), and thus will appear late.

Or say de/serialization succeeds. Since the resulting Unicode string
differs depending on the encoding (which is a good thing; it is
supposed to make most cases mostly readable), when the filesystem
encoding changes (say from legacy to UTF-8), the "name" changes, and
deserialized references to it become stale.

This can probably be handled through careful use of the same
encoding/decoding scheme, if relevant, but that sounds like we've just
moved the problem from fs/environment access to serialization. Is that
good enough? For other uses the API knew whether it was
environmentally aware, but serialization probably will not. Should
this PEP make recommendations about how to save filenames in
configuration files?

-- 
Michael Urman