Processing text data with different encodings
Peter Otten
__peter__ at web.de
Tue Jun 28 04:30:20 EDT 2016
Michael Welle wrote:
> Hello,
>
> I want to use Python 3 to process data, that unfortunately come with
> different encodings. So far I have found ascii, iso-8859, utf-8,
> windows-1252 and maybe some more in the same file (don't ask...). I read
> the data via sys.stdin and the idea is to read a line, detect the
> current encoding, hit it until it looks like utf-8 and then go on with
> the next line of input:
>
>
> import cchardet
>
> for line in sys.stdin.buffer:
>
> encoding = cchardet.detect(line)['encoding']
> line = line.decode(encoding, 'ignore')\
> .encode('UTF-8').decode('UTF-8', 'ignore')
Here the last decode('UTF-8', 'ignore') undoes the preceding
encode('UTF-8'); therefore
line = line.decode(encoding, 'ignore')
should suffice. Does chardet ever return an encoding that fails to decode
the line? Only in that case the "ignore" error handler would make sense. I
expect that
for line in sys.stdin.buffer:
encoding = cchardet.detect(line)['encoding']
line = line.decode(encoding)
will work if you don't want to use the alternative suggested by Chris.
> After that line should be a string. The logging module and some others
> choke on line: UnicodeEncodeError: 'charmap' codec can't encode
> character. What would be a right approach to tackle that problem
> (assuming that I can't change the input data)?
It looks like you are trying to write the unicode you have generated above
into a file using iso-8859-1 or similar:
$ cat log_unicode.py
import logging
LOGGER = logging.getLogger()
LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="ISO-8859-1"))
LOGGER.critical("\N{PILE OF POO}")
$ python3 log_unicode.py
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit
stream.write(msg)
UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f4a9' in
position 0: ordinal not in range(256)
Call stack:
File "log_unicode.py", line 5, in <module>
LOGGER.critical("\N{PILE OF POO}")
Message: '💩'
Arguments: ()
If my assumption is correct you can either change the target file's encoding
to UTF-8 or change the error handling strategy to ignore or something else.
I didn't find an official way, so here's a minimal example:
$ rm tmp.txt
$ cat log_unicode.py
import logging
class FileHandler(logging.FileHandler):
def _open(self):
return open(
self.baseFilename, self.mode, encoding=self.encoding,
errors="xmlcharrefreplace")
LOGGER = logging.getLogger()
LOGGER.addHandler(FileHandler("tmp.txt", encoding="ISO-8859-1"))
LOGGER.critical("\N{PILE OF POO}")
$ python3 log_unicode.py
$ cat tmp.txt
💩
A real program would of course override the initializer...
More information about the Python-list
mailing list