[Python-bugs-list] [ python-Bugs-460020 ] bug or feature: unicode() and subclasses
noreply@sourceforge.net
noreply@sourceforge.net
Tue, 11 Sep 2001 15:32:42 -0700
Bugs item #460020, was opened at 2001-09-09 08:41
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=460020&group_id=5470
Category: Type/class unification
Group: None
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Tim Peters (tim_one)
Summary: bug or feature: unicode() and subclasses
Initial Comment:
The unicode constructor returns the object passed in,
when an instance of a subclass of unicode is passed in:
--
class U(unicode):
pass
u1 = U(u"foo")
print type(u1)
u2 = unicode(u1)
print type(u2)
--
this gives
--
<type '__main__.U'>
<type '__main__.U'>
--
instead of
--
<type '__main__.U'>
<type 'unicode'>
--
as it probably should be (The unicode constructor
should construct unicode objects). With the current
behaviour it is nearly impossible to construct a
unicode object with the value of an instance of a
unicode subclass, because most methods are optimized
to return the original object if possible, e.g.
--
print type(unicode.__getslice__(u1, 0, 3))
print type(unicode.__getslice__(u1, 0, 2))
--
gives
--
<type '__main__.U'>
<type 'unicode'>
--
This should be made consistent, so that either a
unicode object is always returned, or all methods use
a "virtual constructor", i.e. create an object of the
type passed in. This would simplify deriving classes
from unicode as far fewer methods have to be
overwritten.
But first of all the constructor should be fixed, so
that the argument is returned unmodified only when it
is an instance of unicode and not of a unicode
subclass.
----------------------------------------------------------------------
>Comment By: Tim Peters (tim_one)
Date: 2001-09-11 15:32
Message:
Logged In: YES
user_id=31435
A number of similar long optimizations were disabled for
long subclasses, in
Lib/test/test_descr.py; new revision: 1.44
Objects/longobject.c; new revision: 1.105
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-11 14:55
Message:
Logged In: YES
user_id=31435
For F a subclass of float, disabled the
+F(whatever)
optimization, in
Lib/test/test_descr.py; new revision: 1.43
Objects/floatobject.c; new revision: 2.98
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-11 14:45
Message:
Logged In: YES
user_id=31435
For I a subclass of int, disabled the
+I(whatever)
I(0) << whatever
I(0) >> whatever
I(whatever) << 0
I(whatever) >> 0
optimizations, in
Lib/test/test_descr.py; new revision: 1.42
Objects/intobject.c; new revision: 2.74
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-11 12:50
Message:
Logged In: YES
user_id=31435
Here we go again. For tuples, hunted down and disabled t
[:], t*0 and t*1 optimizations when t is of a tuple
subclass type:
Lib/test/test_descr.py; new revision: 1.41
Objects/tupleobject.c; new revision: 2.60
More later (this is time-consuming work).
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-11 09:59
Message:
Logged In: YES
user_id=31435
Oh well -- it's stuck at "Accepted".
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-11 09:56
Message:
Logged In: YES
user_id=31435
Trying to change Resolution to something sensible
("Accepted" doesn't make sense).
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 07:49
Message:
Logged In: YES
user_id=6380
> Python uses it, e.g. in Lib/UserString.py:
[and other cases]
Yes, and I'm no longer comfortable with such code, for
exactly the reason I mentioned, unless it's an explicit and
intentional part of the class specification. :-(
Doing this consistenyly for all built-in types would cause
too much upheaval -- we'd have to change every single
built-in operation.
But the other interpretation stands: unicode (and other)
operations should only optimize by returning "self" when
self is a strict instance of the type.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-11 07:03
Message:
Logged In: YES
user_id=89016
> You're asking for the impossible though.
> I don't think any other OO language supports
> this automatically (although I
> could be wrong).
Python uses it, e.g. in Lib/UserString.py:
def rstrip(self): return self.__class__(self.data.rstrip
())
So if someone derives a new class X from UserString,
calling X("y ").rstrip() returns an X object. The only
assumption that UserString makes, is that the derived class
has a constructor that can handle at least the same
arguments as UserString.__init__.
This "virtual constructor" is used in several places:
grep -l "self.__class__(" `find -name '*.py' | grep -v Mac`
returns:
./dist/src/Lib/UserString.py
./dist/src/Lib/copy.py
./dist/src/Lib/MimeWriter.py
./dist/src/Lib/test/test_descr.py
./dist/src/Lib/xml/sax/xmlreader.py
./dist/src/Lib/UserList.py
./dist/src/Demo/pdist/rcvs.py
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 05:04
Message:
Logged In: YES
user_id=6380
Apologies. I missed half of what you were asking. It's
impossible for U(...)[0:2] to return a U instance, but I
agree that then at least then it should *always* return a
unicode instance.
So this is still open. For Tim: the problem is that a slice
(or other) operation may decide to return the original
object unchanged; this should (probably?) only be done when
the original object is exactly a unicode instance. I'm
afraid that we'll have to systematically look through all
144 Unicode methods to see where they exhibit this behavior.
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-11 05:01
Message:
Logged In: YES
user_id=6380
You're asking for the impossible though. I don't think any
other OO language supports this automatically (although I
could be wrong). The problem is, what to do with a subclass
of unicode like this:
class U(unicode):
def __init__(self, arg):
self.orig = arg
How is U("foobar")[0:3] going to know what argument to pass
in to __init__? The base class simply can't know what
additional invariants the subclass imposes.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2001-09-11 04:31
Message:
Logged In: YES
user_id=89016
Thanks for the quick fix, but the second problem still
remains:
---
class U(unicode):
pass
u = U(u"foo")
print type(u[0:3])
print type(u[0:2])
---
This gives:
---
<type '__main__.U'>
<type 'unicode'>
---
I think this should be changed to either always return a
unicode object, or to always return an instance of the real
class passed in. (This should be done for all unicode
methods that return a new unicode object). The second
solution would simplify creating derived classes, because
all the methods that return unicode objects would
automatically return the derived type, so these methods
don't have to be overwritten.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 20:09
Message:
Logged In: YES
user_id=31435
unicode() repaired in
Include/unicodeobject.h; new revision: 2.33
Lib/test/test_descr.py; new revision: 1.39
Objects/unicodeobject.c; new revision: 2.111
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 18:43
Message:
Logged In: YES
user_id=31435
str() repaired (yes, unicode is next <wink>), in
Include/stringobject.h; new revision: 2.31
Lib/test/test_descr.py; new revision: 1.37
Objects/object.c; new revision: 2.146
Objects/stringobject.c; new revision: 2.130
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 16:39
Message:
Logged In: YES
user_id=31435
tuple() repaired, in
Include/tupleobject.h; new revision: 2.27
Lib/test/test_descr.py; new revision: 1.36
Objects/abstract.c; new revision: 2.77
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 14:29
Message:
Logged In: YES
user_id=31435
float() also repaired, in
Include/floatobject.h; new revision: 2.20
Lib/test/test_descr.py; new revision: 1.34
Objects/abstract.c; new revision: 2.76
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 13:57
Message:
Logged In: YES
user_id=31435
Partially repaired (for int and long) in:
Include/intobject.h; new revision: 2.24
Include/longintrepr.h; new revision: 2.12
Include/longobject.h; new revision: 2.24
Lib/test/test_descr.py; new revision: 1.33
Objects/abstract.c; new revision: 2.75
Objects/longobject.c; new revision: 1.104
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-09-10 13:45
Message:
Logged In: YES
user_id=31435
Reassigned to me.
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-09-10 07:48
Message:
Logged In: YES
user_id=6380
Good catch! Other types also suffer from this, e.g. int.
added to my to-do list.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=460020&group_id=5470