[Python-Dev] unicode_internal codec and the PEP 393

Wed Nov 9 22:49:35 CET 2011

Le mercredi 9 novembre 2011 22:03:52, vous avez écrit :

> 
> > Should we:
> >  * Drop this codec (public and documented, but I don't know if it is
> >  used) * Use wchar_t* (Py_UNICODE*) to provide a result similar to
> >  Python 3.2, and
> > 
> > so fix the decoder to handle surrogate pairs
> > 
> >  * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)
> 
> It's described as "Return the internal representation of the operand".
> That would suggest that the last choice (i.e. return the real internal
> representation) would be best, except that this doesn't round-trip.
> Adding a prefix byte indicating the kind (and perhaps also the ASCII
> flag) would then be closest to the real representation.
> 
> As that is likely not very useful, and might break some applications
> of the encoding (if there are any at all) which might expect to
> pass unicode-internal strings across Python versions, I would then
> also deprecate the encoding.

After a quick search on Google codesearch (before it disappears!), I don't 
think that "encoding" a Unicode string to its internal PEP-393 representation 
would satisfy any program. It looks like wchar_t* is a better candidate. 
Programs use maybe unicode_internal to decode strings coming from libraries 
using wchar_t* (and no PyUnicodeObject).

taskcoach, drag & drop code using wxPython:

     data = self.__thunderbirdMailDataObject.GetData()
     # We expect the data to be encoded with 'unicode_internal',
     # but on Fedora it can also be 'utf-16', be prepared:
     try:
          data = data.decode('unicode_internal')
     except UnicodeDecodeError:
          data = data.decode('utf-16')

=> thunderbirdMailDataObject.GetData() result type should be a Unicode, not 
bytes

hydrat, tokenizer:

     def bytes(str):
         return filter(lambda x: x != '\x00', str.encode('unicode_internal'))

=> this algorithm is really strange...

djebel, fscache/rst.py

     class RstDocument(object):
         ...
         def __init__(self, path, options={}):
             opts = {'input_encoding': 'euc-jp',
                     'output_encoding': 'unicode_internal',
                     'doctitle_xform': True,
                     'file_insertion_enabled': True}
             ...
             doctree = core.publish_doctree(source=file(path, 'rb').read(),
                                            ...,
                                            settings_overrides=opts)
             ...
             content = parts['html_body'] or u''
             if not isinstance(content, unicode):
                 content = unicode(content, 'unicode_internal')
             if not isinstance(title, unicode):
                 title = unicode(title, 'unicode_internal')
             ...

=> I don't understand this code

Victor