[New-bugs-announce] [issue38800] Resume position for UTF-8 codec error handler not working

Thu Nov 14 09:30:24 EST 2019

New submission from Pedro Gimeno <pgpyb-4414 at personal.formauri.es>:

When implementing an error handler, it must return a tuple consisting of a substitution string and a position where to resume decoding. In the case of the UTF-8 codec, the resume position is ignored, and it always resumes immediately after the character that caused the error.

To reproduce, use this code:

import codecs
codecs.register_error('err', lambda err: (b'x', err.end + 1))
assert repr(u'\uDD00yz'.encode('utf8', errors='err')) == b'xz'

The above code fails the assertion because the result is b'xyz'.

It works OK for some other codecs. I have not tried to make an exhaustive list of which ones work and which ones don't, therefore this problem might apply to others.

----------
messages: 356610
nosy: pgimeno
priority: normal
severity: normal
status: open
title: Resume position for UTF-8 codec error handler not working
type: behavior

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38800>
_______________________________________