Capturing the bad codes that raise UnicodeError exceptions during decoding
Michael Selik
michael.selik at gmail.com
Thu Aug 4 16:11:05 EDT 2016
On Thu, Aug 4, 2016 at 3:24 PM Malcolm Greene <python at bdurham.com> wrote:
> Hi Chris,
>
> Thanks for your suggestions. I would like to capture the specific bad
> codes *before* they get replaced. So if a line of text has 10 bad codes
> (each one raising UnicodeError), I would like to track each exception's
> bad code but still return a valid decode line when finished.
>
> My goal is to count the total number of UnicodeExceptions within a file
> (as a data quality metric) and track the frequency of specific bad
> code's (via a collections.counter dict) to see if there's a pattern that
> can be traced to bad upstream process.
>
Give this a shot (below). It seems to do what you want.
import csv
from collections import Counter
from io import BytesIO
def _cleanline(line, counts=Counter()):
try:
return line.decode()
except UnicodeDecodeError as e:
counts[line[e.start:e.end]] += 1
return line[:e.start].decode() + _cleanline(line[e.end:], counts)
def cleanlines(fp):
'''
convert data to text; track decoding errors
``fp`` is an open file-like iterable of lines'
'''
cleanlines.errors = Counter()
for line in fp:
yield _cleanline(line, cleanlines.errors)
f = BytesIO(b'''\
this,is line,one
line two,has junk,\xffin it
so does,\xfa\xffline,three
''')
for row in csv.reader(cleanlines(f)):
print(row)
print(cleanlines.errors.most_common())
More information about the Python-list
mailing list