[Patches] [ python-Patches-446754 ] Enhanced unicode constructor
noreply@sourceforge.net
noreply@sourceforge.net
Fri, 21 Sep 2001 07:56:26 -0700
Patches item #446754, was opened at 2001-08-01 05:59
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470
Category: None
Group: None
Status: Closed
Resolution: Rejected
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Enhanced unicode constructor
Initial Comment:
This patch (against descr-branch) uses a slightly
enhanced version of PyObject_Unicode instead of
PyUnicode_FromEncodedObject in the unicode constructor
(Objects/unicodeobject.h/unicode_new), which gives the
unicode constructor the same functionality as the str
constructor: creating string representations with the
__str__ method/tp_str slot. Example:
Python 2.2a1 (#10, Aug 1 2001, 14:26:06)
[GCC 2.95.2 19991024 (release)] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> str("u"), unicode("u")
('u', u'u')
>>> str(u"u"), unicode(u"u")
('u', u'u')
>>> str(None), unicode(None)
('None', u'None')
>>> str(42), unicode(42)
('42', u'42')
>>> str(23.), unicode(23.)
('23.0', u'23.0')
>>> str([1,2,3]), unicode([1,2,3])
('[1, 2, 3]', u'[1, 2, 3]')
>>> str({"u": 23, u"ü": 42}), unicode({"u": 23, u"ü":
42})
("{'u': 23, u'\xfc': 42}", u"{'u': 23, u'\\xfc':
42}")
>>> class foo:
... def __init__(self, x):
... self.x = x
... def __str__(self):
... return self.x
...
>>> str(foo("bar")), unicode(foo("bar"))
('bar', u'bar')
>>> str(foo(u"bar")), unicode(foo(u"bar"))
('bar', u'bar')
Passing the encoding and errors argument still works
and the will be used for any 8bit string returned from
__str__.
Perhaps for symmetry encoding and errors arguments
should be added to the str constructor too, which will
be used when a unicode object is returned from __str__
for encoding the object.
One problem is that unicode([u"ü"]) returns
u"[u'\\xfc']" because __repr__ returns a 8bit escape
encoded string, it would be better if the result was
u"[u'\xfc']", but this would require a
PyObject_UnicodeRepr (and/or changing the list tp_str
slot (and many others) to return Unicode)
----------------------------------------------------------------------
>Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-21 07:56
Message:
Logged In: YES
user_id=89016
I'm not on python-dev.
http://mail.python.org/mailman/listinfo/python-dev
says: "Subscription is by invitation only"
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-09-21 07:28
Message:
Logged In: YES
user_id=38388
Let's move this discussion to python-dev. I've already started a thread on it.
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-21 07:13
Message:
Logged In: YES
user_id=6380
I'm not keen on extending the str() signature. str() takes
other argument types besides Unicode strings, but the
encoding arguments only make sense for Unicode strings -
ergo, I think the encoding should be a Unicode string method
(which it is).
I do think that PyObject_Unicode() in C should be closely
tied to unicode() in Python.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-21 05:23
Message:
Logged In: YES
user_id=89016
I've read the following in Misc/NEWS
"PyUnicode_FromEncodedObject() now works very much like
PyObject_Str(obj) in that it tries to use __str__/tp_str on
the object if the object is not a string or buffer. This
makes unicode() behave like str() when applied to non-
string/buffer objects."
I think this is the wrong way: The string constructor uses
PyObject_Str, so the unicode constructor should use
PyObject_Unicode (i.e. the extended version).
As PyObject_UnicodeEx has additional arguments, maybe we
should have a new function PyObject_StrEx with these
additional arguments too. Then the string constructor could
have these arguments too:
>>> print str(u"\xfc\u2014", "latin-1", "replace")
ü?
This would really make string and unicode symmetric:
unicode() creates a string either directy or via
__str__/tp_str, if it encounters a 8bit string it uses the
encoding and errors parameters to *decode* it. Likewise str
() creates a 8bit string either directly or via
__str__/tp_str, if it encounters a unicode object it uses
the encoding and errors parameters to *encode* it.
Would this change to the str constructor result in any
backwards incompatiblities?
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-20 05:50
Message:
Logged In: YES
user_id=89016
OK, the new patch no longer changes the PyObject_Unicode
signature. Now there are two functions: PyObject_Unicode
with the old signature, and PyObject_UnicodeEx, which has
the additional encoding and errors arguments.
PyObject_Unicode(x) simply calls PyObject_UnicodeEx
(x,NULL,NULL)
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-09-20 03:56
Message:
Logged In: YES
user_id=38388
Rejecting this patch: see patch #413333 -- this is basically the same request and the patch is not usable since it
breaks the Unicode C API. I still like the idea of making str() and unicode() have similar semantics.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-08-01 06:55
Message:
Logged In: YES
user_id=89016
Oops, sorry I accidentally hit the submit button! :-/
The foo class is the one from the first message.
str(...) always does a unicodeescape encoding when it
encounters a Unicode object, so you'll get escape
characters, but this will not be decoded when constructing
the unicode object. I.e. u"..." != unicode(str(u"...")).
Basically the unicode type should have the same
functionality as the str type to ease unicode migration.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-08-01 06:49
Message:
Logged In: YES
user_id=89016
>>> unicode(u"ü")
u'\xfc'
*>>> unicode(str(u"ü"))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range
(128)
>>> unicode(foo(u"ü"))
u'\xfc'
*>>> unicode(str(foo(u"ü")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range
(128)
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-08-01 06:45
Message:
Logged In: YES
user_id=6380
What does this have to offer over unicode(str(x))?
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470