Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Nobody nobody at nowhere.com
Wed Dec 1 03:43:51 EST 2010


On Wed, 01 Dec 2010 02:14:09 +0000, MRAB wrote:

> If the filenames are to be shown to a user then there needs to be a
> mapping between bytes and glyphs. That's an encoding. If different
> users use different encodings then exchange of textual data becomes
> difficult.

OTOH, the exchange of binary data is unaffected. In the worst case, users
see a few wrong glyphs, but the software doesn't care.

> That's where encodings which can be used globally come in.
> By the time Python 4 is released I'd be surprised if Unix hadn't
> standardised on a single encoding like UTF-8.

That's probably not a serious option in parts of the world which don't use
a latin-based alphabet, i.e. outside western Europe and its former
colonies. In countries with non-latin alphabets, existing encodings are
often too heavily entrenched.

There's also a lot of legacy software which can only handle unibyte
encodings, and not much incentive to fix it if 98% of your market can get
by with an ISO-8859-<whatever> locale (making software work in e.g. CJK
locales often requires a lot more work than just dealing with encodings).

And it doesn't help that Windows has negligible support for UTF-8. It's
either UTF-16-LE (i.e. the in-memory format dumped directly to file) or
one of Microsoft's non-standard encodings. At least the latter are mostly
compatible with the corresponding ISO-8859-* encoding.

Finally, ISO-8859-* encoding/decoding can't fail. The result might
be complete gibberish, but converting to gibberish then back to bytes
won't lose information.




More information about the Python-list mailing list