[Python-Dev] unicode_internal codec and the PEP 393
Victor Stinner
victor.stinner at haypocalc.com
Wed Nov 9 22:49:35 CET 2011
Le mercredi 9 novembre 2011 22:03:52, vous avez écrit :
>
> > Should we:
> > * Drop this codec (public and documented, but I don't know if it is
> > used) * Use wchar_t* (Py_UNICODE*) to provide a result similar to
> > Python 3.2, and
> >
> > so fix the decoder to handle surrogate pairs
> >
> > * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)
>
> It's described as "Return the internal representation of the operand".
> That would suggest that the last choice (i.e. return the real internal
> representation) would be best, except that this doesn't round-trip.
> Adding a prefix byte indicating the kind (and perhaps also the ASCII
> flag) would then be closest to the real representation.
>
> As that is likely not very useful, and might break some applications
> of the encoding (if there are any at all) which might expect to
> pass unicode-internal strings across Python versions, I would then
> also deprecate the encoding.
After a quick search on Google codesearch (before it disappears!), I don't
think that "encoding" a Unicode string to its internal PEP-393 representation
would satisfy any program. It looks like wchar_t* is a better candidate.
Programs use maybe unicode_internal to decode strings coming from libraries
using wchar_t* (and no PyUnicodeObject).
taskcoach, drag & drop code using wxPython:
data = self.__thunderbirdMailDataObject.GetData()
# We expect the data to be encoded with 'unicode_internal',
# but on Fedora it can also be 'utf-16', be prepared:
try:
data = data.decode('unicode_internal')
except UnicodeDecodeError:
data = data.decode('utf-16')
=> thunderbirdMailDataObject.GetData() result type should be a Unicode, not
bytes
hydrat, tokenizer:
def bytes(str):
return filter(lambda x: x != '\x00', str.encode('unicode_internal'))
=> this algorithm is really strange...
djebel, fscache/rst.py
class RstDocument(object):
...
def __init__(self, path, options={}):
opts = {'input_encoding': 'euc-jp',
'output_encoding': 'unicode_internal',
'doctitle_xform': True,
'file_insertion_enabled': True}
...
doctree = core.publish_doctree(source=file(path, 'rb').read(),
...,
settings_overrides=opts)
...
content = parts['html_body'] or u''
if not isinstance(content, unicode):
content = unicode(content, 'unicode_internal')
if not isinstance(title, unicode):
title = unicode(title, 'unicode_internal')
...
=> I don't understand this code
Victor
More information about the Python-Dev
mailing list