Processing text data with different encodings

Steven D'Aprano steve at
Tue Jun 28 11:11:31 EDT 2016

On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote:

> I changed the code from my initial mail to:
> LOGGER = logging.getLogger()
> LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="utf-8"))
> for l in sys.stdin.buffer:
>     l = l.decode('utf-8')
>     LOGGER.critical(l)

I imagine you're running this over input known to contain UTF-8 text?
Because if you run it over your emails with non-UTF8 content, you'll get an

I would try this:

for l in sys.stdin.buffer:
    l = l.decode('utf-8', errors='surrogateescape')
    print(repr(l))  # or log it, whichever you prefer

If I try simulating that, you'll see the output:

py> buffer = []
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('latin-1'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer
[b'ab\xc3\xbcd\n', b'ab\xc3\xbcd\n', b'ab\xfcd\n', b'ab\xc3\xbcd\n']
py> for l in buffer:  #sys.stdin.buffer:
...     l = l.decode('utf-8', errors='surrogateescape')
...     print(repr(l))

See the second last line? The \udcfc code point is a surrogate, encoding
the "bad byte" \xfc. See the docs for further details.

Alternatively, you could try:

for l in sys.stdin.buffer:
        l = l.decode('utf-8', errors='strict')
    except UnicodeDecodeError:
        l = l.decode('latin1')  # May generate mojibake.
    print(repr(l))  # or log it, whichever you prefer

This version should give satisfactory results if the email actually does
contain lines of Latin-1 (or Windows-1252 if you prefer) mixed in with the
UTF-8. If not, it will generate mojibake, which may be acceptable to your

“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

More information about the Python-list mailing list