[Python-Dev] string.find() again (was Re: timsort for jython)

Guido van Rossum guido@python.org
Tue, 06 Aug 2002 08:30:35 -0400


[Samuele]
> If
> 
> "thon" in "python"
> 
> then why not
> 
> [1,2] in [0,1,2,3]
> 
> (it's a purely rhetorical question)
> 
> in general I don't think it is a good idea
> to have "in" be a membership vs subset/subseq
> operator depending on non ambiguity, convenience
> or simply implementer taste,
> because truly there are data types (ex. sets)
> that would need both and disambiguated.
> 
> Either python grows a new subset/subseq operator
> but probably this is overkill (keyword issue, new
> __magic__ method, not meaningful, con
> venient for a lot of types)
> 
> or strings (etc) should simply grow a new
> method with an appropriate name.

I recognize this as related to the argument that Ping was (still is?)
making against "for x in <iterator>"; but not because the same
operator "in" is involved.

It has to do with polymorphism (functions that accept different types
of arguments; it's somewhat different from operator overloading).

Suppose we have an operator @.  (Take operator in a wide enough sense,
including other bits of grammar, like "for".)  If there's only one
type (or one narrow set or related types) for which @ makes sense,
human readers of a program will use @ as a clue about the type of the
arguments, and (if correct) that will help reasoning about the
expression in which it occurs.

ABC uses this property of operators to do type inference: if an ABC
expression contains "a+b", a and b must be numbers; and so on.

Python chose to allow operators to be overloaded by different types
with different meanings, and the language gives a+b a very different
meaning for numbers than for sequences, for example.  (And an
important invariant is lost in this example: for numbers, a+b == b+a,
but not so for sequences!)

Is this a problem?

The ease with which we get used to "key in dict" makes me think it is
not.  While Python doesn't require you to declare the types of your
arguments, the type (or set of allowed types) for arguments is usually
strongly known in the mind of the programmer, and most often strong
hints are given either by the choice of argument name or by
documentation.

While it's possible in theory, in practice nobody writes polymorphic
code that uses + and * on its arguments and yet accepts both numbers
and strings.

The reality is that some types are more related than others, and the
substitutability property only makes sense for types that are
sufficiently related.  We *do* write code that accepts any kind of
sequence, including strings.  We do *not* write code that accepts any
kind of container (sequence or mapping), even though some operations
apply to both kinds of container (len, a[b], and since 2.2, x in a).

In code that applies to all (or even just some) kinds of sequences,
the 'in' operator will continue to stand for membership.  This won't
cause a problem with strings: correct code using 'in' for membership
will never use seq1 in seq2, it will use item in seq, where the type
of item is "whatever the type of seq[0] is, if it exists."  When the
seq is a string, item will be a one-char string -- not a "type" in
Python's type system, but certainly a useful concept.

But there's also lots of code that deals only with strings.  This is
normally be completely clear to the casual reader: either because
string literals are used, compared, etc., or because values are
obtained from functions known to return strings (such as
file.readline()), or because methods unique to strings (e.g. s.lower()
are used, and so on.  Strings are very important in lots of programs,
and we want our notations for string operations to be readable and
expressive.  (Regular expressions are extreme in expressiveness, but
lack readability, which is why they're relegated to an imported module
in Python.)  Substring containment testing is a common operation on
strings, so being able to write it as 's1 in s2' rather than
's2.find(s1) >= 0' is a big win, IMO.


PS. Sets are a different case again.  They are containers but neither
sequences nor mappings (though depending on what you want to do they
can resemble either).  We will have to think about which operators
make sense for them.  I'd say that 'elem in set' is an appropriate way
to spell set membership; how to spell subset is a matter of discussion
(maybe 'set1 <= set2' is a good idea; maybe not).

--Guido van Rossum (home page: http://www.python.org/~guido/)