how to optimize object creation/reading from file?

Wed Jan 28 11:02:04 EST 2009

On Jan 28, 10:06 am, Bruno Desthuilliers <bruno.
42.desthuilli... at websiteburo.invalid> wrote:
> perfr... at gmail.com a écrit :
>
>
>
> > hi,
>
> > i am doing a series of very simple string operations on lines i am
> > reading from a large file (~15 million lines). i store the result of
> > these operations in a simple instance of a class, and then put it
> > inside of a hash table. i found that this is unusually slow... for
> > example:
>
> > class myclass(object):
> >     __slots__ = ("a", "b", "c", "d")
> >     def __init__(self, a, b, c, d):
> >         self.a = a
> >         self.b = b
> >         self.c = c
> >         self.d = d
> >     def __str__(self):
> >         return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
> >     def __hash__(self):
> >         return hash((self.a, self.b, self.c, self.d))
> >     def __eq__(self, other):
> >         return (self.a == other.a and \
> >                 self.b == other.b and \
> >                 self.c == other.c and \
> >                 self.d == other.d)
> >     __repr__ = __str__
>
> If your class really looks like that, a tuple would be enough.
>
> > n = 15000000
> > table = defaultdict(int)
> > t1 = time.time()
> > for k in range(1, n):
>
> hint : use xrange instead.
>
> >     myobj = myclass('a' + str(k), 'b', 'c', 'd')
> >     table[myobj] = 1
>
> hint : if all you want is to ensure unicity, use a set instead.
>
> > t2 = time.time()
> > print "time: ", float((t2-t1)/60.0)
>
> hint : use timeit instead.
>
> > this takes a very long time to run: 11 minutes!. for the sake of the
> > example i am not reading anything from file here but in my real code i
> > do. also, i do 'a' + str(k) but in my real code this is some simple
> > string operation on the line i read from the file. however, i found
> > that the above code shows the real bottle neck, since reading my file
> > into memory (using readlines()) takes only about 4 seconds. i then
> > have to iterate over these lines, but i still think that is more
> > efficient than the 'for line in file' approach which is even slower.
>
> iterating over the file, while indeed a bit slower on a per-line basis,
> avoid useless memory comsuption which can lead to disk swapping - so for
>   "huge" files, it might still be better wrt/ overall performances.
>
> > in the above code is there a way to optimize the creation of the class
> > instances ? i am using defaultdicts instead of ordinary ones so i dont
> > know how else to optimize that part of the code. is there a way to
> > perhaps optimize the way the class is written? if takes only 3 seconds
> > to read in 15 million lines into memory it doesnt make sense to me
> > that making them into simple objects while at it would take that much
> > more...
>
> Did you bench the creation of a 15.000.000 ints list ?-)
>
> But anyway, creating 15.000.000 instances (which is not a small number)
> of your class takes many seconds - 23.466073989868164 seconds on my
> (already heavily loaded) machine. Building the same number of tuples
> only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
> tuples have all the useful characteristics of your above class (wrt/
> hashing and comparison).
>
> My 2 cents...

thanks for your insight ful reply - changing to tuples made a big
change!