catch UnicodeDecodeError

Thu Jul 26 08:17:44 EDT 2012

On 07/26/2012 01:15 PM, Stefan Behnel wrote:
>> exits with a UnicodeDecodeError.
> ... that tells you the exact code line where the error occurred.

Which property of a UnicodeDecodeError does include that information?

On cPython 2.7 and 3.2, I see only start and end, both of which refer to
the number of bytes read so far.

I used the followin test script:

e = None
try:
    b'a\xc3\xa4\nb\xff0'.decode('utf-8')
except UnicodeDecodeError as ude:
    e = ude
print(e.start) # 5 for this input, 3 for the input b'a\nb\xff0'
print(dir(e))

But even if you would somehow determine a line number, this would only
work if the actual encoding uses 0xa for newline. Most encodings (101
out of 108 applicable ones in cPython 3.2) do include 0x0a in their
representation of '\n', but multi-byte encodings routinely include 0x0a
bytes in their representation of non-newline characters. Therefore, the
most you can do is calculate an upper bound for the line number.

- Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20120726/135ce8e7/attachment.sig>