[Patches] [ python-Patches-446754 ] Enhanced unicode constructor

noreply@sourceforge.net noreply@sourceforge.net
Mon, 24 Sep 2001 09:46:45 -0700


Patches item #446754, was opened at 2001-08-01 05:59
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470

Category: None
Group: None
Status: Closed
Resolution: Rejected
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Enhanced unicode constructor

Initial Comment:
This patch (against descr-branch) uses a slightly 
enhanced version of PyObject_Unicode instead of 
PyUnicode_FromEncodedObject in the unicode constructor 
(Objects/unicodeobject.h/unicode_new), which gives the 
unicode constructor the same functionality as the str 
constructor: creating string representations with the 
__str__ method/tp_str slot. Example:

Python 2.2a1 (#10, Aug  1 2001, 14:26:06) 
[GCC 2.95.2 19991024 (release)] on linux2
Type "help", "copyright", "credits" or "license" for 
more information.
>>> str("u"), unicode("u")
('u', u'u')
>>> str(u"u"), unicode(u"u")
('u', u'u')
>>> str(None), unicode(None)
('None', u'None')
>>> str(42), unicode(42)
('42', u'42')
>>> str(23.), unicode(23.)
('23.0', u'23.0')
>>> str([1,2,3]), unicode([1,2,3])
('[1, 2, 3]', u'[1, 2, 3]')
>>> str({"u": 23, u"ü": 42}), unicode({"u": 23, u"ü": 
42})
("{'u': 23, u'\xfc': 42}", u"{'u': 23, u'\\xfc': 
42}")
>>> class foo:
...    def __init__(self, x):
...       self.x = x
...    def __str__(self):
...       return self.x
... 
>>> str(foo("bar")), unicode(foo("bar"))
('bar', u'bar')
>>> str(foo(u"bar")), unicode(foo(u"bar"))
('bar', u'bar')

Passing the encoding and errors argument still works
and the will be used for any 8bit string returned from 
__str__.

Perhaps for symmetry encoding and errors arguments 
should be added to the str constructor too, which will 
be used when a unicode object is returned from __str__ 
for encoding the object.

One problem is that unicode([u"ü"]) returns 
u"[u'\\xfc']" because __repr__ returns a 8bit escape 
encoded string, it would be better if the result was 
u"[u'\xfc']", but this would require a 
PyObject_UnicodeRepr (and/or changing the list tp_str 
slot (and many others) to return Unicode)


----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-24 09:46

Message:
Logged In: YES 
user_id=6380

The administrators have been sleeping.

I've approved your request (and about a dozen others, and
rejected 9 spams :-).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-24 09:42

Message:
Logged In: YES 
user_id=89016

I tried subscribing from 
http://mail.python.org/mailman/listinfo/python-dev, but 
sofar nothing has happened, can someone please subscribe me 
to python-dev? (email address is walter@livinglogic.de)

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-21 08:03

Message:
Logged In: YES 
user_id=6380

Consider yourself invited. :-)

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-21 07:56

Message:
Logged In: YES 
user_id=89016

I'm not on python-dev. 
http://mail.python.org/mailman/listinfo/python-dev 
says: "Subscription is by invitation only"

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-09-21 07:28

Message:
Logged In: YES 
user_id=38388

Let's move this discussion to python-dev. I've already started a thread on it.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-21 07:13

Message:
Logged In: YES 
user_id=6380

I'm not keen on extending the str() signature. str() takes
other argument types besides Unicode strings,  but the
encoding arguments only make sense for Unicode strings -
ergo, I think the encoding should be a Unicode string method
(which it is).

I do think that PyObject_Unicode() in C should be closely
tied to unicode() in Python.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-21 05:23

Message:
Logged In: YES 
user_id=89016

I've read the following in Misc/NEWS

"PyUnicode_FromEncodedObject() now works very much like 
PyObject_Str(obj) in that it tries to use __str__/tp_str on 
the object if the object is not a string or buffer. This 
makes unicode() behave like str() when applied to non-
string/buffer objects."

I think this is the wrong way: The string constructor uses 
PyObject_Str, so the unicode constructor should use 
PyObject_Unicode (i.e. the extended version).

As PyObject_UnicodeEx has additional arguments, maybe we 
should have a new function PyObject_StrEx with these 
additional arguments too. Then the string constructor could 
have these arguments too:

>>> print str(u"\xfc\u2014", "latin-1", "replace")
ü?

This would really make string and unicode symmetric:
unicode() creates a string either directy or via 
__str__/tp_str, if it encounters a 8bit string it uses the 
encoding and errors parameters to *decode* it. Likewise str
() creates a 8bit string either directly or via 
__str__/tp_str, if it encounters a unicode object it uses 
the encoding and errors parameters to *encode* it.

Would this change to the str constructor result in any 
backwards incompatiblities?

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-20 05:50

Message:
Logged In: YES 
user_id=89016

OK, the new patch no longer changes the PyObject_Unicode 
signature. Now there are two functions: PyObject_Unicode 
with the old signature, and PyObject_UnicodeEx, which has 
the additional encoding and errors arguments. 
PyObject_Unicode(x) simply calls PyObject_UnicodeEx
(x,NULL,NULL)


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-09-20 03:56

Message:
Logged In: YES 
user_id=38388

Rejecting this patch: see patch #413333 -- this is basically the same request and the patch is not usable since it 
breaks the Unicode C API. I still like the idea of making str() and unicode() have similar semantics.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-08-01 06:55

Message:
Logged In: YES 
user_id=89016

Oops, sorry I accidentally hit the submit button! :-/
The foo class is the one from the first message.

str(...) always does a unicodeescape encoding when it 
encounters a Unicode object, so you'll get escape 
characters, but this will not be decoded when constructing 
the unicode object. I.e. u"..." != unicode(str(u"...")).

Basically the unicode type should have the same 
functionality as the str type to ease unicode migration.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-08-01 06:49

Message:
Logged In: YES 
user_id=89016

>>> unicode(u"ü")
u'\xfc'
*>>> unicode(str(u"ü"))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range
(128)

>>> unicode(foo(u"ü"))
u'\xfc'
*>>> unicode(str(foo(u"ü")))    
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range
(128)

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-08-01 06:45

Message:
Logged In: YES 
user_id=6380

What does this have to offer over unicode(str(x))?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470