Python3: Sane way to deal with broken encodings

Sun Dec 6 15:04:44 EST 2009

Johannes Bauer a écrit :
> Dear all,
> 
> I've some applciations which fetch HTML docuemnts off the web, parse
> their content and do stuff with it. Every once in a while it happens
> that the web site administrators put up files which are encoded in a
> wrong manner.
> 
> Thus my Python script dies a horrible death:
> 
>   File "./update_db", line 67, in <module>
>     for line in open(tempfile, "r"):
>   File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
> 3286: unexpected code byte
> 
> This is well and ok usually, but I'd like to be able to tell Python:
> "Don't worry, some idiot encoded that file, just skip over such
> parts/replace them by some character sequence".
> 
> Is that possible? If so, how?

This might get you started:

"""
>>> help(str.decode)
decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.
"""

HTH