Processing text data with different encodings
Steven D'Aprano
steve at pearwood.info
Tue Jun 28 11:11:31 EDT 2016
On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote:
> I changed the code from my initial mail to:
>
> LOGGER = logging.getLogger()
> LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="utf-8"))
>
> for l in sys.stdin.buffer:
> l = l.decode('utf-8')
> LOGGER.critical(l)
I imagine you're running this over input known to contain UTF-8 text?
Because if you run it over your emails with non-UTF8 content, you'll get an
exception.
I would try this:
for l in sys.stdin.buffer:
l = l.decode('utf-8', errors='surrogateescape')
print(repr(l)) # or log it, whichever you prefer
If I try simulating that, you'll see the output:
py> buffer = []
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('latin-1'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer
[b'ab\xc3\xbcd\n', b'ab\xc3\xbcd\n', b'ab\xfcd\n', b'ab\xc3\xbcd\n']
py> for l in buffer: #sys.stdin.buffer:
... l = l.decode('utf-8', errors='surrogateescape')
... print(repr(l))
...
'abüd\n'
'abüd\n'
'ab\udcfcd\n'
'abüd\n'
See the second last line? The \udcfc code point is a surrogate, encoding
the "bad byte" \xfc. See the docs for further details.
Alternatively, you could try:
for l in sys.stdin.buffer:
try:
l = l.decode('utf-8', errors='strict')
except UnicodeDecodeError:
l = l.decode('latin1') # May generate mojibake.
print(repr(l)) # or log it, whichever you prefer
This version should give satisfactory results if the email actually does
contain lines of Latin-1 (or Windows-1252 if you prefer) mixed in with the
UTF-8. If not, it will generate mojibake, which may be acceptable to your
users.
--
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list