Python3: Sane way to deal with broken encodings

Martin v. Loewis martin at v.loewis.de
Tue Dec 8 13:26:28 EST 2009


> Thus my Python script dies a horrible death:
> 
>   File "./update_db", line 67, in <module>
>     for line in open(tempfile, "r"):
>   File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
> 3286: unexpected code byte
> 
> This is well and ok usually, but I'd like to be able to tell Python:
> "Don't worry, some idiot encoded that file, just skip over such
> parts/replace them by some character sequence".
> 
> Is that possible? If so, how?

As Benjamin says: if you pass errors='replace' to open, then it will
replace the faulty characters; if you pass errors='ignore', it will
skip over them.

Alternatively, you can open the files in binary ('rb'), so that no
decoding will be attempted at all, or you can specify latin-1 as
the encoding, which means that you can decode all files successfully
(though possibly not correctly).

Regards,
Martin



More information about the Python-list mailing list