what encoding is this? How can I tell? How can I translate?

Skip Montanaro skip at pobox.com
Tue Sep 25 10:22:52 EDT 2001


    >> I can infer that what looks like a capital "O" underneath a tilde in
    >> XEmacs (ordinal 213, hex 0xd5) is supposed to be an apostrophe, so I
    >> could do some hack filtering to convert this, but a quick scan for
    >> "d5" in the Python encodings directory suggests it is mac_latin2 (not
    >> sure what that is officially).

    Carey> I would have picked it as being mac-roman, unless it's from
    Carey> somewhere in Eastern Europe that Latin-2 covers.

Actually, the mail came from the good old U S of A, so probably mac-roman is
right.  "Mac-roman" conjures up images of Italy for me, not the US.

    Carey> The character would be U+2019, "RIGHT SINGLE QUOTATION MARK".
    Carey> There's no equivalent to this in latin-1, so the closest would
    Carey> probably be U+0027, "APOSTROPHE", i.e. "'".

So if I understand what you're saying, 213 (well within the range of 256)
gets mapped to 0x2019 on input, which then can't be mapped to latin-1 on
output.  That means a whole bunch of common encodings can't cleanly be
mapped to latin-1, such as the cp1252 thing I see so many mail messages
written in.

Maybe the encodings package should provide some sort of "crippled" encoding
that attempts to make these heuristic transformations, mapping everything
possible into range(256).  If not, I'm still left with a sed or tr hack.

Fat-lot-of-good-unicode-is-doing-ly, y'rs,

-- 
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/




More information about the Python-list mailing list