Why are some unicode error handlers "encode only"?
Walter Dörwald
walter at livinglogic.de
Sun Mar 11 12:10:12 EDT 2012
On 11.03.12 15:37, Steven D'Aprano wrote:
> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this? I don't see why they shouldn't work for decoding as well.
Because xmlcharrefreplace and backslashreplace are *error* handlers.
However the bytes sequence b'〹' does *not* contain any bytes that
are not decodable for e.g. the ASCII codec. So there are no errors to
handle.
> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
> File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?
The byte sequence b'\xe9!' however is not something that would have been
produced by the backslashreplace error handler. b'\\xe9!' (a sequence
containing 5 bytes) would have been (and this probably would decode
without any problems with the cp932 codec).
> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> => r'aaa--騷--\xe9\x21--bbb'
>
> and similarly for xmlcharrefreplace.
This would require a postprocess step *after* the bytes have been
decoded. This is IMHO out of scope for Python's codec machinery.
Servus,
Walter
More information about the Python-list
mailing list