Capturing the bad codes that raise UnicodeError exceptions during decoding
MRAB
python at mrabarnett.plus.com
Thu Aug 4 15:33:12 EDT 2016
On 2016-08-04 20:22, Malcolm Greene wrote:
> Hi Chris,
>
> Thanks for your suggestions. I would like to capture the specific bad
> codes *before* they get replaced. So if a line of text has 10 bad codes
> (each one raising UnicodeError), I would like to track each exception's
> bad code but still return a valid decode line when finished.
>
> My goal is to count the total number of UnicodeExceptions within a file
> (as a data quality metric) and track the frequency of specific bad
> code's (via a collections.counter dict) to see if there's a pattern that
> can be traced to bad upstream process.
>
You could catch the UnicodeDecodeError exception and look at its attributes:
try:
b'\x80'.decode('utf-8')
except UnicodeDecodeError as e:
print('Failed to decode')
print('e.start is', e.start)
print('e.end is', e.end)
else:
print('Decoded successfully')
It prints:
Failed to decode
e.start is 0
e.end is 1
> Malcolm
>
> <snipped>
> Remove them? Not sure what you mean, exactly; but would an
> errors="backslashreplace" decode do the job? Something like (assuming
> you use Python 3):
>
> def read_dirty_file(fn):
> with open(fn, encoding="utf-8", errors="backslashreplace") as f:
> for row in csv.DictReader(f):
> process(row)
>
> You'll get Unicode text, but any bytes that don't make sense in UTF-8
> will be represented as eg \x80, with an actual backslash. Or use
> errors="replace" to hide them all behind U+FFFD, or other forms of
> error handling. That'll get done at a higher level than the CSV
> reader, like you suggest.
> </snipped>
>
More information about the Python-list
mailing list