[Python-bugs-list] [ python-Bugs-665835 ] filter() treatment of str and tuple inconsistent

Tue, 04 Feb 2003 09:15:17 -0800

Bugs item #665835, was opened at 2003-01-10 17:36
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=665835&group_id=5470

Category: Python Interpreter Core
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Raymond Hettinger (rhettinger)
Summary: filter() treatment of str and tuple inconsistent

Initial Comment:
class tuple2(tuple):
·  def __getitem__(self, index):
·  ·  return 2*tuple.__getitem__(self, index)

class str2(str):
·  def __getitem__(self, index):
·  ·  return chr(ord(str.__getitem__(self, index))+1)

print filter(lambda x: x>1, tuple2((1, 2)))
print filter(lambda x: x>"a", str2("ab"))

this prints:
(2,)
bc

i.e. the overwritten __getitem__ is ignored in the
first case, but honored in the second.

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2003-02-04 18:15

Message:
Logged In: YES 
user_id=89016

OK, the problem of __getitem__ not returning str or unicode
is fixed. Unfortunately the result is rather ugly. With the
following class:

class u(unicode):
   def __getitem__(self, index):
      return u(2*unicode.__getitem__(self, index))

filter neither returns a list nor an u object, but a unicode
object, defeating the whole purpose of the special treatment
of str/unicode. If we remove the special treatment this
problem would go away, furthermore __getitem__ returning
objects that are not str/unicode instances wouldn't be
problem any longer.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-02-03 23:36

Message:
Logged In: YES 
user_id=6380

Walter: if you can fix the bug in your latest message here,
go ahead and check it in. Seems like a case of a missing test.

Raymond: it turns out that the iterator in Python 2.2 has
the same problem with lists -- it special-cases lists. But
for tuples, the iterator uses PySequence_GetItem; the fast
tuple iterator in Python 2.3 introduces the problem for
tuples though.

I actually don't think there would be much disagreement that
this behavior (ignoring __getitem__) is a bug. There may be
disagreement over how important it is to fix it. Personally,
I've generally been on the side of "it needn't be fixed if
it slows down the common case", as long as a workaround
(like overriding __iter__ alongside the __getitem__
override) exists.

But I draw the line at being backwards incompatible with
Python 2.2. There fore I think the tuple iterator (and
probably also the string iterator) needs to be fixed, and I
still think that it would be best if the list iterator were
also fixed. One way to do this would be for the tp_iter
implementation to check whether
self->ob_type->tp_as_sequence->sq_item is not equal to the
list_item function (this is a good check to detect a
__getitem__ override) and then return an instance of the
generic sequence iterator instead of the list-specific iterator.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-27 13:24

Message:
Logged In: YES 
user_id=89016

Another problem with filter() is that filterstring() (and
the new filterunicode()) blindly assume that
tp_as_sequence->sq_item returns a str or unicode object with
len==1. This might fail with str or unicode subclasses:
----
class badstr(str):
   def __getitem__(self, index):
      return 42

s = filter(lambda x: x>=42, badstr("1234"))
print len(s), repr(s)
----
This prints
4 '\x00\x00\x00\x00'

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-01-27 02:13

Message:
Logged In: YES 
user_id=80475

One other thought:  A major reason for implementing 
__iter__ in the first place is that objects were overriding 
__getitem__ and disregarding the index -- the __getitem__ 
interface just didn't make sense for iteration in some 
situations.  __iter__ was supposed to provide enormous 
flexibility in various ways to loop over a collection (inorder, 
preorder, postorder, priorityorder, sortedorder, hashorder, 
randomorder, etc).  Making iter() default to using 
__getitem__ was only supposed to be an expedient for 
backwards compatability.  Always using __getitem__ 
diminishes the flexibility and speed advantages.

Maybe the discussion belongs on python-dev. I'm sure a 
number of people feel strongly one way or the other.  The 
question might as well be addressed head-on before 2.3 
goes out the door. 

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-01-27 01:54

Message:
Logged In: YES 
user_id=80475

I understand.  Ideally, *all* methods would respond to a 
single overridden method, but I think this is just a fact of 
life in object oriented programming.

I can't remember where you gave an example of a 
d.__getitem__() subclass override, but you were careful to 
point out that other methods, like d.get() also needed to 
be overridden so that the modified access applied 
everywhere.  Likewise, __iter__() or any other object 
access method must be assumed to access the underlying 
data structure directly and must be overridden.   For 
instance, creating a dictionary with case insensitive 
lookups entails overriding __getitem__(k), get(k,default), 
and pop(k) -- no one of them can be presumed to inform 
the others.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-01-27 01:17

Message:
Logged In: YES 
user_id=6380

Hm... that means that iter() of *amy* built-in type subclass
overriding __getitem__ bypasses the override, unless the
subclass also overrides __iter__. This sounds like a step in
the wrong direction. I think the built-in iterators should
be aware of subclasses overriding __getitem__ one way or
another. I hadn't realized this when we started the trend of
creating faster iterators for built-in types. :-(

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-01-25 17:45

Message:
Logged In: YES 
user_id=80475

None of the existing iterators (incl dicts, lists, tuples, and 
files) use __getitem__.  Most likely, user defined iterators 
also access the data structure directly (for flexiblity and 
speed). Also, anything that uses PyTuple_GET_ITEM 
bypasses __getitem__.

If string/unicode iterators are added, they should also go
directly to the underlying data; otherwise, there is no point 
to it.

Also, the proposal to change filtertuple(), doesn't solve
inconsistencies within filterstring() which uses __getitem__ 
when there is a function call, but bypasses it when the 
function parameter is Py_None.

I think the right answer is to change filterstring() to use an 
iterator and to implement string/unicode iterators that 
access the data directly (not using __getitem__).

FYI for Tim:  MvL noticed and fixed the unicode vs string 
difference.  His patch, SF #636005, has not been applied 
yet.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-01-25 14:51

Message:
Logged In: YES 
user_id=6380

(But in addition th that, I don't mind having a custom
string iterator -- as long as it calls __getitem__ properly.
Hm, shouldn't the tuple iterator call __getitem__ properly too?)

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-01-25 14:45

Message:
Logged In: YES 
user_id=31435

Just noting that filter() is unique in special-casing the type 
of the input.  It's always been surprising that way, and, 
e.g., filtering a string produces a string, but filtering a 
Unicode string produces a list.

map() and reduce() don't play games like that, and always 
use the iteration protocol to march over their inputs.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-01-25 14:36

Message:
Logged In: YES 
user_id=6380

I don't know which Python sources Raymond has been reading,
but in the sources I've got in front of me, there are
special cases for strings and tuples, and these *don't* use
iter(). It so happens that the tuple special-case calls
PyTuple_GetItem(), which doesn't call your __getitem__,
while the string special-case calls the sq_item slot
function, which (in your case) will be a wrapper that calls
your __getitem__.

A minimal fix would be to only call filtertuple for strict
tuples -- although this changes the output type, but I don't
think one should count on filter() of a tuple subclass
returning a tuple (and it can't be made to return an
instance of the subclass either -- we don't know the
constructor signature).

Similar fixes probably need to be made to map() and maybe
reduce().

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-01-25 04:47

Message:
Logged In: YES 
user_id=80475

The problem isn't with filter() which correctly calls iter() in 
both cases.

Tuple object have their own iterator which loops over 
elements directly and has no intervening calls to 
__getitem__().

String objects do not define a custom iterator, so iter() 
wraps itself around consecutive calls to __getitem__().

The resolution is to provide string objects with their own 
iterator. As a side benefit, iteration will run just a tiny bit 
faster.  The same applies to unicode objects.

Guido, do you care about this and want me to fix it or 
would you like to close it as "won't fix".

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=665835&group_id=5470