Performance of list comprehensions vs. map

Chris Barker chrishbarker at home.net
Wed Sep 5 20:12:44 CEST 2001


Hi all,

I just took another look at:

http://musi-cal.mojam.com/~skip/python/fastpython.html

specifically, the loops section:

"""
Loops

Python supports a couple of looping constructs. The for statement is
most commonly used. It loops over the elements of a sequence, assigning
each to the loop
variable. If the body of your loop is simple, the interpreter overhead
of the for loop itself can be a substantial amount of the overhead. This
is where the map
function is handy. You can think of map as a for moved into C code. The
only restriction is that the "loop body" of map must be a function call. 

Here's a straightforward example. Instead of looping over a list of
words and converting them to upper case: 

newlist = []
for word in list:
    newlist.append(word.upper())

you can use map to push the loop from the interpreter into compiled C
code: 

import string
newlist = map(string.upper, list)

List comprehensions were added to Python in version 2.0 as well. They
provide a syntactically more compact way of writing the above for loop: 

newlist = [s.upper() for s in list]

It's not really any faster than the for loop version, however. 
"""

my question is: Why not? I like list comprehensions a lot, and they
would seem to offer a way to get optimization over a for loop, much like
map does, but apparently not. If map can do the loop in C, why can't a
list comprehension? 

In general, I like list comprehensions because they make Python a more
"sequence oriented" language: When programming, it is very common to
store a sequence of data of some type, and more often than not, do
something with that whole sequence, rather than individual elements of
that sequence. All that looping just to do the same thing to all the
elements really clutters up the logic of the code. map and list
comprehensions really help (as will element-wise operators, if we ever
get them). However, it seems to me that sequence-oriented structures
could really provide a performance benifit as well, but Python has not
gotten there yet. I think a big stumbling block is that Python has no
concept of the "homogenous" sequence. There are modules that provide
such a  thing (Numeric, array), and indeed, the string type is a
homogenous sequence, but Python itself does not understand this concept,
so that if a map or list comprehesion is using a homogenous list, it
still has to check the type of every element as it goes through the
list. For example:

[2*a for a in list]

If the interpreter could just check once what the type of all the
elements of list were, it would only have to figure out what 2*a meant
once, and it could whip through the list very quickly. Indeed, if you
do:

[2*a for a in a_string]

The string is a homogenous sequence by definition (once you check that
it is a string), but the interpreter has no understanding of this
concept.

NOTE: I fully understand that adding a concept like this allows for a
whole set of optimizations that COULD be done, but someone would have to
get around to doing it, which would be a whole lot of work, and might
never happen. That's not a reason not to do it. First of all, the
framework has to be in place for this kind of optimization before anyone
can even start writing the code, and even if there are no optimizations
ever built into the interpreter, the concept of a homogenous sequence
would be very useful for extension writers. A number of times I have
written an extension that needed a list if which all the elements where
the same type. I had to put a type check inside my loop, which clutters
up the code, and slows things down some. If I'm working with numvers, I
use NumPy, but that's not always possible.

As for what might get added, I'm imagining that it should be possible to
automatically maintain a "homogeous" flag on a sequence: for an
imputable sequence it would get set when it was built. For a mutable
sequence, each time an element is added, a type check could be done, and
the homogenous flag turned off if the type didn't match. (turning it
back on when that item was removed would probably not be worth it).
Ideally, there would be a way for the programmer to define what was
meant by homogenous (i.e.: a NumPy array of doubles, or any Numpy array)
This would be trickier, but when in doubt, the homogenous flag would
just be turned off, and we'd be back where we started.

Even if no changes were made to the list type, the concept could be put
in place, and then existing homogenous sequences could be used more
optimally.

I've proposed similar ideas in the past, and not gotten much response. I
imagine my ideas are full of holes, but no one has taken the time to
point them out to me yet. I have come to the conclusion that the really
sharp minds on this list (and the folks that might be qualified to
actually impliment such a thing) are not really interested in something
like this that is a perfomance only improvement. Am I right? or is the
idea just so stupid that it's not worth commenting on?

-Chris


-- 
Christopher Barker,
Ph.D.                                                           
ChrisHBarker at home.net                 ---           ---           ---
http://members.home.net/barkerlohmann ---@@       -----@@       -----@@
                                   ------@@@     ------@@@     ------@@@
Oil Spill Modeling                ------   @    ------   @   ------   @
Water Resources Engineering       -------      ---------     --------    
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------



More information about the Python-list mailing list