Processing text data with different encodings

Random832 random832 at fastmail.com
Tue Jun 28 11:52:18 EDT 2016


On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote:
> For the OP's situation, frankly, I doubt there'll be anything other
> than UTF-8, Latin-1, and CP-1252. The chances that someone casually
> mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the
> simple decode of "UTF-8, or failing that, 1252" is probably going to
> give correct results for most of the content. The trick is figuring
> out a correct boundary for the check; line-by-line may be sufficient,
> or it may not.

For completeness, this can be done character-by-character (i.e. try to
decode a UTF-8 character, if it fails decode the offending byte as 1252)
with an error handler:

import codecs

def cp1252_errors(exception):
    input, idx = exception.object, exception.start
    byte = input[idx:idx+1]
    try:
        return byte.decode('windows-1252'), idx+1
    except UnicodeDecodeError:
        # python's cp1252 doesn't accept 0x81, etc
        return byte.decode('latin1'), idx+1

codecs.register_error('cp1252', cp1252_errors)

assert b"t\xe9st\xc3\xadng".decode('utf-8', errors='cp1252') ==
"t\u00e9st\u00edng"

This is probably sufficient for most purposes; byte sequences that
happen to be valid UTF-8 characters but mean something sensible in
cp-1252 are rare. Just be fortunate that that's all you have to deal
with - the equivalent problem for Japanese encodings, for instance, is
much harder (you'd probably want the boundary to be "per run of
non-ASCII* characters" if lines don't suffice, and detecting the
difference between UTF-8, Shift-JIS, and EUC-JP is nontrivial). There's
a reason the word "mojibake" comes from Japanese.

*well, JIS X 0201, which is ASCII but for 0x5C and 0x7E. And unless
you've got ISO-2022 codes to provide context for that, you've just got
to guess what those two bytes mean. Fortunately (fsvo), many
environments' fonts display the relevant ASCII characters as their JIS
alternatives, taking that choice away from you.


More information about the Python-list mailing list