Capturing the bad codes that raise UnicodeError exceptions during decoding
Malcolm Greene
python at bdurham.com
Thu Aug 4 17:03:06 EDT 2016
Wow!!! A huge thank you to all who replied to this thread!
Chris: You gave me some ideas I will apply in the future.
MRAB: Thanks for exposing me to the extended attributes of the UnicodeError object (e.start, e.end, e.object).
Mike: Cool example! I like how _cleanlines() recursively calls itself to keep cleaning up a line after an error is handled. Your code solved the mystery of how to recover from a UnicodeError and keep decoding.
Random832: Your suggestion to write a custom codecs handler was great. Sample below for future readers reviewing this thread.
# simple codecs custom error handler
import codecs
def custom_unicode_error_handler(e):
bad_bytes = e.object[e.start:e.end]
print( 'Bad bytes: ' + bad_bytes.hex())
return ('<?>', e.end)
codecs.register_error('custom_unicode_error_handler',
custom_unicode_error_handler)
Malcolm
More information about the Python-list
mailing list