[New-bugs-announce] [issue13916] disallow the "surrogatepass" handler for non utf-* encodings

Kang-Hao (Kenny) Lu report at bugs.python.org
Wed Feb 1 00:51:11 CET 2012


New submission from Kang-Hao (Kenny) Lu <kennyluck at csail.mit.edu>:

Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '\0' ending (so that ((p[0] & 0xf0) == 0xe0 || (p[1] & 0xc0) == 0x80 || (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash).

However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former.

Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue #12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)

----------
components: Unicode
messages: 152416
nosy: ezio.melotti, kennyluck
priority: normal
severity: normal
status: open
title: disallow the "surrogatepass" handler for non  utf-* encodings
type: behavior
versions: Python 3.1, Python 3.2, Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13916>
_______________________________________


More information about the New-bugs-announce mailing list