unicode_internal codec and the PEP 393

Hi, The unicode_internal decoder doesn't decode surrogate pairs and so test_unicode.UnicodeTest.test_codecs() is failing on Windows (16-bit wchar_t). I don't know if this codec is still revelant with the PEP 393 because the internal representation is now depending on the maximum character (Py_UCS1*, Py_UCS2* or Py_UCS4*), whereas it was a fixed size with Python <= 3.2 (Py_UNICODE*). Should we: * Drop this codec (public and documented, but I don't know if it is used) * Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and so fix the decoder to handle surrogate pairs * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string) ? The failure on Windows: FAIL: test_codecs (test.test_unicode.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\Buildslave\3.x.moore-windows\build\lib\test\test_unicode.py", line 1408, in test_codecs self.assertEqual(str(u.encode(encoding),encoding), u) AssertionError: '\ud800\udc01\ud840\udc02\ud880\udc03\ud8c0\udc04\ud900\udc05' != '\U00030003\U00040004\U00050005' Victor

The unicode_internal decoder doesn't decode surrogate pairs and so test_unicode.UnicodeTest.test_codecs() is failing on Windows (16-bit wchar_t). I don't know if this codec is still revelant with the PEP 393 because the internal representation is now depending on the maximum character (Py_UCS1*, Py_UCS2* or Py_UCS4*), whereas it was a fixed size with Python <= 3.2 (Py_UNICODE*).
The current status is the way it is because we (Torsten and me) didn't bother figuring out the purpose of the internal codec.
Should we:
* Drop this codec (public and documented, but I don't know if it is used) * Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and so fix the decoder to handle surrogate pairs * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)
It's described as "Return the internal representation of the operand". That would suggest that the last choice (i.e. return the real internal representation) would be best, except that this doesn't round-trip. Adding a prefix byte indicating the kind (and perhaps also the ASCII flag) would then be closest to the real representation. As that is likely not very useful, and might break some applications of the encoding (if there are any at all) which might expect to pass unicode-internal strings across Python versions, I would then also deprecate the encoding. Regards, Martin

Le mercredi 9 novembre 2011 22:03:52, vous avez écrit :
Should we: * Drop this codec (public and documented, but I don't know if it is used) * Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and
so fix the decoder to handle surrogate pairs
* Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)
It's described as "Return the internal representation of the operand". That would suggest that the last choice (i.e. return the real internal representation) would be best, except that this doesn't round-trip. Adding a prefix byte indicating the kind (and perhaps also the ASCII flag) would then be closest to the real representation.
As that is likely not very useful, and might break some applications of the encoding (if there are any at all) which might expect to pass unicode-internal strings across Python versions, I would then also deprecate the encoding.
After a quick search on Google codesearch (before it disappears!), I don't think that "encoding" a Unicode string to its internal PEP-393 representation would satisfy any program. It looks like wchar_t* is a better candidate. Programs use maybe unicode_internal to decode strings coming from libraries using wchar_t* (and no PyUnicodeObject). taskcoach, drag & drop code using wxPython: data = self.__thunderbirdMailDataObject.GetData() # We expect the data to be encoded with 'unicode_internal', # but on Fedora it can also be 'utf-16', be prepared: try: data = data.decode('unicode_internal') except UnicodeDecodeError: data = data.decode('utf-16') => thunderbirdMailDataObject.GetData() result type should be a Unicode, not bytes hydrat, tokenizer: def bytes(str): return filter(lambda x: x != '\x00', str.encode('unicode_internal')) => this algorithm is really strange... djebel, fscache/rst.py class RstDocument(object): ... def __init__(self, path, options={}): opts = {'input_encoding': 'euc-jp', 'output_encoding': 'unicode_internal', 'doctitle_xform': True, 'file_insertion_enabled': True} ... doctree = core.publish_doctree(source=file(path, 'rb').read(), ..., settings_overrides=opts) ... content = parts['html_body'] or u'' if not isinstance(content, unicode): content = unicode(content, 'unicode_internal') if not isinstance(title, unicode): title = unicode(title, 'unicode_internal') ... => I don't understand this code Victor

After a quick search on Google codesearch (before it disappears!), I don't think that "encoding" a Unicode string to its internal PEP-393 representation would satisfy any program. It looks like wchar_t* is a better candidate.
Ok. Making it Py_UNICODE, documenting that, and deprecating the encoding sounds fine to me as well. Regards, Martin

Le 09/11/2011 23:45, "Martin v. Löwis" a écrit :
After a quick search on Google codesearch (before it disappears!), I don't think that "encoding" a Unicode string to its internal PEP-393 representation would satisfy any program. It looks like wchar_t* is a better candidate.
Ok. Making it Py_UNICODE, documenting that, and deprecating the encoding sounds fine to me as well.
Done. Victor
participants (2)
-
"Martin v. Löwis"
-
Victor Stinner