[Python-bugs-list] [ python-Bugs-460020 ] bug or feature: unicode() and subclasses

noreply@sourceforge.net noreply@sourceforge.net
Tue, 11 Sep 2001 14:55:10 -0700


Bugs item #460020, was opened at 2001-09-09 08:41
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=460020&group_id=5470

Category: Type/class unification
Group: None
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Tim Peters (tim_one)
Summary: bug or feature: unicode() and subclasses

Initial Comment:
The unicode constructor returns the object passed in, 
when an instance of a subclass of unicode is passed in:
--
class U(unicode):
   pass

u1 = U(u"foo")
print type(u1)
u2 = unicode(u1)
print type(u2) 
--
this gives
--
<type '__main__.U'>
<type '__main__.U'>
--
instead of
--
<type '__main__.U'>
<type 'unicode'>
--
as it probably should be (The unicode constructor 
should construct unicode objects). With the current 
behaviour it is nearly impossible to construct a 
unicode object with the value of an instance of a 
unicode subclass, because most methods are optimized 
to return the original object if possible, e.g.
--
print type(unicode.__getslice__(u1, 0, 3))
print type(unicode.__getslice__(u1, 0, 2))
--
gives
--
<type '__main__.U'>
<type 'unicode'>
--
This should be made consistent, so that either a 
unicode object is always returned, or all methods use 
a "virtual constructor", i.e. create an object of the 
type passed in. This would simplify deriving classes 
from unicode as far fewer methods have to be 
overwritten.

But first of all the constructor should be fixed, so 
that the argument is returned unmodified only when it 
is an instance of unicode and not of a unicode 
subclass.


----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2001-09-11 14:55

Message:
Logged In: YES 
user_id=31435

For F a subclass of float, disabled the

+F(whatever)

optimization, in

Lib/test/test_descr.py; new revision: 1.43
Objects/floatobject.c; new revision: 2.98

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-11 14:45

Message:
Logged In: YES 
user_id=31435

For I a subclass of int, disabled the

+I(whatever)
I(0) << whatever
I(0) >> whatever
I(whatever) << 0
I(whatever) >> 0

optimizations, in

Lib/test/test_descr.py; new revision: 1.42
Objects/intobject.c; new revision: 2.74

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-11 12:50

Message:
Logged In: YES 
user_id=31435

Here we go again.  For tuples, hunted down and disabled t
[:], t*0 and t*1 optimizations when t is of a tuple 
subclass type:

Lib/test/test_descr.py; new revision: 1.41
Objects/tupleobject.c; new revision: 2.60

More later (this is time-consuming work).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-11 09:59

Message:
Logged In: YES 
user_id=31435

Oh well -- it's stuck at "Accepted".

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-11 09:56

Message:
Logged In: YES 
user_id=31435

Trying to change Resolution to something sensible 
("Accepted" doesn't make sense).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 07:49

Message:
Logged In: YES 
user_id=6380

> Python uses it, e.g. in Lib/UserString.py:
[and other cases]

Yes, and I'm no longer comfortable with such code, for
exactly the reason I mentioned, unless it's an explicit and
intentional part of the class specification. :-(

Doing this consistenyly for all built-in types would cause
too much upheaval -- we'd have to change every single
built-in operation.

But the other interpretation stands:  unicode (and other)
operations should only optimize by returning "self" when
self is a strict instance of the type.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-11 07:03

Message:
Logged In: YES 
user_id=89016

> You're asking for the impossible though.
> I don't think any other OO language supports
> this automatically (although I
> could be wrong). 

Python uses it, e.g. in Lib/UserString.py:
   def rstrip(self): return self.__class__(self.data.rstrip
())

So if someone derives a new class X from UserString, 
calling X("y ").rstrip() returns an X object. The only 
assumption that UserString makes, is that the derived class 
has a constructor that can handle at least the same 
arguments as UserString.__init__.

This "virtual constructor" is used in several places:
grep -l "self.__class__(" `find -name '*.py' | grep -v Mac`
returns:
./dist/src/Lib/UserString.py
./dist/src/Lib/copy.py
./dist/src/Lib/MimeWriter.py
./dist/src/Lib/test/test_descr.py
./dist/src/Lib/xml/sax/xmlreader.py
./dist/src/Lib/UserList.py
./dist/src/Demo/pdist/rcvs.py


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 05:04

Message:
Logged In: YES 
user_id=6380

Apologies. I missed half of what you were asking. It's
impossible for U(...)[0:2] to return a U instance, but I
agree that then at least then it should *always* return a
unicode instance.

So this is still open. For Tim: the problem is that a slice
(or other) operation may decide to return the original
object unchanged; this should (probably?) only be done when
the original object is exactly a unicode instance. I'm
afraid that we'll have to systematically look through all
144 Unicode methods to see where they exhibit this behavior.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 05:01

Message:
Logged In: YES 
user_id=6380

You're asking for the impossible though. I don't think any
other OO language supports this automatically (although I
could be wrong). The problem is, what to do with a subclass
of unicode like this:

class U(unicode):
  def __init__(self, arg):
    self.orig = arg

How is U("foobar")[0:3] going to know what argument to pass
in to __init__? The base class simply can't know what
additional invariants the subclass imposes.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-11 04:31

Message:
Logged In: YES 
user_id=89016

Thanks for the quick fix, but the second problem still 
remains:
---
class U(unicode):
   pass

u = U(u"foo")

print type(u[0:3])
print type(u[0:2])
---
This gives:
---
<type '__main__.U'>
<type 'unicode'>
---
I think this should be changed to either always return a 
unicode object, or to always return an instance of the real 
class passed in. (This should be done for all unicode 
methods that return a new unicode object). The second 
solution would simplify creating derived classes, because 
all the methods that return unicode objects would 
automatically return the derived type, so these methods 
don't have to be overwritten.


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 20:09

Message:
Logged In: YES 
user_id=31435

unicode() repaired in

Include/unicodeobject.h; new revision: 2.33
Lib/test/test_descr.py; new revision: 1.39
Objects/unicodeobject.c; new revision: 2.111

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 18:43

Message:
Logged In: YES 
user_id=31435

str() repaired (yes, unicode is next <wink>), in

Include/stringobject.h; new revision: 2.31
Lib/test/test_descr.py; new revision: 1.37
Objects/object.c; new revision: 2.146
Objects/stringobject.c; new revision: 2.130

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 16:39

Message:
Logged In: YES 
user_id=31435

tuple() repaired, in

Include/tupleobject.h; new revision: 2.27
Lib/test/test_descr.py; new revision: 1.36
Objects/abstract.c; new revision: 2.77

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 14:29

Message:
Logged In: YES 
user_id=31435

float() also repaired, in

Include/floatobject.h; new revision: 2.20
Lib/test/test_descr.py; new revision: 1.34
Objects/abstract.c; new revision: 2.76


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 13:57

Message:
Logged In: YES 
user_id=31435

Partially repaired (for int and long) in:

Include/intobject.h; new revision: 2.24
Include/longintrepr.h; new revision: 2.12
Include/longobject.h; new revision: 2.24
Lib/test/test_descr.py; new revision: 1.33
Objects/abstract.c; new revision: 2.75
Objects/longobject.c; new revision: 1.104


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-09-10 13:45

Message:
Logged In: YES 
user_id=31435

Reassigned to me.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-10 07:48

Message:
Logged In: YES 
user_id=6380

Good catch! Other types also suffer from this, e.g. int.

added to my to-do list.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=460020&group_id=5470