Capturing the bad codes that raise UnicodeError exceptions during decoding
Malcolm Greene
python at bdurham.com
Thu Aug 4 15:22:48 EDT 2016
Hi Chris,
Thanks for your suggestions. I would like to capture the specific bad
codes *before* they get replaced. So if a line of text has 10 bad codes
(each one raising UnicodeError), I would like to track each exception's
bad code but still return a valid decode line when finished.
My goal is to count the total number of UnicodeExceptions within a file
(as a data quality metric) and track the frequency of specific bad
code's (via a collections.counter dict) to see if there's a pattern that
can be traced to bad upstream process.
Malcolm
<snipped>
Remove them? Not sure what you mean, exactly; but would an
errors="backslashreplace" decode do the job? Something like (assuming
you use Python 3):
def read_dirty_file(fn):
with open(fn, encoding="utf-8", errors="backslashreplace") as f:
for row in csv.DictReader(f):
process(row)
You'll get Unicode text, but any bytes that don't make sense in UTF-8
will be represented as eg \x80, with an actual backslash. Or use
errors="replace" to hide them all behind U+FFFD, or other forms of
error handling. That'll get done at a higher level than the CSV
reader, like you suggest.
</snipped>
More information about the Python-list
mailing list