what encoding is this? How can I tell? How can I translate?

Martin von Loewis loewis at informatik.hu-berlin.de
Tue Sep 25 11:43:52 EDT 2001


Skip Montanaro <skip at pobox.com> writes:

> So if I understand what you're saying, 213 (well within the range of 256)
> gets mapped to 0x2019 on input, which then can't be mapped to latin-1 on
> output.  That means a whole bunch of common encodings can't cleanly be
> mapped to latin-1, such as the cp1252 thing I see so many mail messages
> written in.

The notion of "commonness" of encodings is a difficult one. Is it
cp1252 that is more common, or is it JIS 0201 (or some other CJK
encoding)?

Regardless, most of the latin-based encodings use the full range of
256 bytes, yet they all differ from latin-1 (unless they are identical
to latin-1). Therefore, all of them, without exception, have characters
that cannot be transformed to UTF-8.

> Maybe the encodings package should provide some sort of "crippled" encoding
> that attempts to make these heuristic transformations, mapping everything
> possible into range(256).  If not, I'm still left with a sed or tr hack.

Indeed, recoding software often has "transliteration"
encodings. I.e. you'd use a iso-8859-1/translit encoder from Unicode,
and it would be capable of converting a wide variety of Unicode
characters. Transliteration is applicable beyond converting to
Latin-1, e.g. when converting "ö" to ASCII, it is common to
transliterate this as "oe" (atleast in Germany).

glibc 2.2 includes a number of transliteration codecs. The iconv codec
module (sf.net/projects/python-codecs, in the practicecodecs) exposes
all the glibc codecs to Python. So if you have a recent Linux system,
you may want to give that a try.

Finding a good transliteration database is hard work, and Bruno Haible
did a great job when writing the glibc transliteration. While it would
be possible to include codecs for transliteration into the standard
Python library, it also would be quite time-consuming to do that.

Regards,
Martin




More information about the Python-list mailing list