[issue8092] utf8, backslashreplace and surrogates

STINNER Victor report at bugs.python.org
Tue Apr 20 23:50:26 CEST 2010


STINNER Victor <victor.stinner at haypocalc.com> added the comment:

Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message:
---
This issue is a regression introduced by r72208 to fix the issue #3672.

Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect.

I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised.

Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits.
---

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8092>
_______________________________________


More information about the Python-bugs-list mailing list