optimizing large dictionaries

Thu Jan 15 17:00:40 EST 2009

On Jan 15, 1:39 pm, Per Freem <perfr... at yahoo.com> wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that i want to
> count
> to their number of occurs. so i iterate through the file and for each
> line
> extract the elements (simple text operation) and see if it has an
> entry in the dict:
>
> for line in file:
>   try:
>     elt = MyClass(line)# extract elt from line...
>     my_dict[elt] += 1
>   except KeyError:
>     my_dict[elt] = 1
>
> i am using try/except since it is supposedly faster (though i am not
> sure
> about this? is this really true in Python 2.5?).
>
> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
> works well.
> the __repr__ and __hash__ methods of my class simply return str()
> representation
> of self, while __str__ just makes everything numeric field into a
> concatenated string:
>
> class MyClass
>
>   def __str__(self):
>     return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>   def __repr__(self):
>     return str(self)
>
>   def __hash__(self):
>     return hash(str(self))
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.

You can use a tuple instead of a string, which should be a little
quicker:

    def __hash__(self):
        return self.field1, self.field2, self.field3

You could speed it up even more if you just saved a single attribute
"fields" as a tuple to begin with.

Also, you can use defauldict and get rid of the try/except. I don't
think try/except is slow, but avoiding it will give you a speed up.

from collections import defaultdict
my_dict = defaultdict(int)

for line in file:
  elt = MyClass(line)# extract elt from line...
  my_dict[elt] += 1

You might even consider turning "MyClass" into just a function that
extracts the values from the line and returns a tuple, which should
give you even more of a boost since a tuple is completely implemented
in C.

Matt