[Python-Dev] unicode_internal codec and the PEP 393
Victor Stinner
victor.stinner at haypocalc.com
Wed Nov 9 11:14:50 CET 2011
Hi,
The unicode_internal decoder doesn't decode surrogate pairs and so
test_unicode.UnicodeTest.test_codecs() is failing on Windows (16-bit wchar_t).
I don't know if this codec is still revelant with the PEP 393 because the
internal representation is now depending on the maximum character (Py_UCS1*,
Py_UCS2* or Py_UCS4*), whereas it was a fixed size with Python <= 3.2
(Py_UNICODE*).
Should we:
* Drop this codec (public and documented, but I don't know if it is used)
* Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and
so fix the decoder to handle surrogate pairs
* Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)
?
The failure on Windows:
FAIL: test_codecs (test.test_unicode.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\Buildslave\3.x.moore-windows\build\lib\test\test_unicode.py", line
1408, in test_codecs
self.assertEqual(str(u.encode(encoding),encoding), u)
AssertionError: '\ud800\udc01\ud840\udc02\ud880\udc03\ud8c0\udc04\ud900\udc05'
!= '\U00030003\U00040004\U00050005'
Victor
More information about the Python-Dev
mailing list