read from file with mixed encodings in Python3
Peter Otten
__peter__ at web.de
Mon Nov 7 09:42:47 EST 2011
Jaroslav Dobrek wrote:
> Hello,
>
> in Python3, I often have this problem: I want to do something with
> every line of a file. Like Python3, I presuppose that every line is
> encoded in utf-8. If this isn't the case, I would like Python3 to do
> something specific (like skipping the line, writing the line to
> standard error, ...)
>
> Like so:
>
> try:
> ....
> except UnicodeDecodeError:
> ...
>
> Yet, there is no place for this construction. If I simply do:
>
> for line in f:
> print(line)
>
> this will result in a UnicodeDecodeError if some line is not utf-8,
> but I can't tell Python3 to stop:
>
> This will not work:
>
> for line in f:
> try:
> print(line)
> except UnicodeDecodeError:
> ...
>
> because the UnicodeDecodeError is caused in the "for line in f"-part.
>
> How can I catch such exceptions?
>
> Note that recoding the file before opening it is not an option,
> because often files contain many different strings in many different
> encodings.
I don't see those files often, but I think they are all seriously broken.
There's no way to recover the information from files with unknown mixed
encodings. However, here's an approach that may sometimes work:
>>> with open("tmp.txt", "rb") as f:
... for line in f:
... try:
... line = "UTF-8 " + line.decode("utf-8")
... except UnicodeDecodeError:
... line = "Latin-1 " + line.decode("latin-1")
... print(line, end="")
...
UTF-8 äöü
Latin-1 äöü
UTF-8 äöü
More information about the Python-list
mailing list