[Tutor] List comprehension for dicts?

Steven D'Aprano steve at pearwood.info
Thu Aug 19 19:20:12 CEST 2010


On Fri, 20 Aug 2010 01:40:54 am Wayne Werner wrote:

> > age_dict = dict([(key.upper(), value) for key,value in
> > age_dict.items()])
>
> This is a bad place to use a list comprehension. This will create a
> list of values first and then create a dict from that list, so now
> you have a list floating around that you didn't need.

How do you know that dict() doesn't need it the entire list at once? 
Unless you're an expert on the implementation, for all you know it 
looks something like this:

class dict:
    def __new__(cls, arg):
        ...
        try:
            n = len(arg)
        except AttributeError:
            # probably a generator expression
            arg = list(arg)
            n = len(arg)
        allocate_memory_for_items(n)
        ...

(only written in C). I'm not saying it does, or that it doesn't, but 
you're assuming a pattern of behaviour which might not be the case.

Here's a similarly common idiom: which of these is faster?

' '.join(gen_expr)
' '.join(list_comp)


[steve at sylar ~]$ python -m timeit "' '.join(str(n) for n in 
xrange(300000))"
10 loops, best of 3: 437 msec per loop
[steve at sylar ~]$ python -m timeit "' '.join([str(n) for n in 
xrange(300000)])"
10 loops, best of 3: 401 msec per loop

The list comprehension is consistently faster, because join() works more 
efficiently if it knows how many items it needs to pre-allocate memory 
for.


> Generator expressions, OTOH, generate the values on the fly, only as
> they're needed, so there's no extra list left over once the dict is
> created.

And sometimes that's a win, and sometimes it's not. 

Generator expressions are more computationally expensive than lists -- 
they're functions which remember their internal state so you can pause 
them and restart them at will. That doesn't happen for free -- it takes 
memory and time. The reason Python has generator expressions is that 
for large amounts of data, the saving you have by not needing to 
produce the entire list all at once more than makes up for the extra 
cost, but for small amounts of data, that's not always the case:

[steve at sylar ~]$ python -m timeit "dict((k,k+1) for k in xrange(2))"
100000 loops, best of 3: 5.89 usec per loop
[steve at sylar ~]$ python -m timeit "dict([(k,k+1) for k in xrange(2)])"
100000 loops, best of 3: 4.78 usec per loop


Here, using a generator expression is a pessimation, not an 
optimization.


> Sure it will eventually be garbage collected, but "waste not, want
> not", as my grandmother used to say.

Does your grandmother have a box labelled "Pieces of string, too short 
to use" as well?



-- 
Steven D'Aprano


More information about the Tutor mailing list