Slow down while creating a big list and iterating over it

Aliaksandr Abushkevich a.abushkevich at gmail.com
Sun Jan 31 03:17:21 EST 2010


Maybe it is a good idea to use Disco (http://discoproject.org/) to
process your data.

Yours faithfully,
Alexander Abushkevich



On Sat, Jan 30, 2010 at 10:36 PM, marc magrans de abril
<marcmagransdeabril at gmail.com> wrote:
> Dear colleagues,
>
> I was doing a small program to classify log files for a cluster of
> PCs, I just wanted to simplify a quite repetitive task in order to
> find errors and so.
>
> My first naive implementation was something like:
>    patterns = []
>    while(logs):
>        pattern = logs[0]
>        new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
>        entry = (len(logs)-len(new_logs),pattern)
>        patterns.append(entry)
>        logs = new_logs
>
> Where dist(...) is the levenshtein distance (i.e. edit distance) and
> logs is something like 1.5M logs (700 MB file). I thought that python
> will be an easy choice although not really fast..
>
> I was not surprised when the first iteration of the while loop was
> taking ~10min. I thought "not bad, let's how much it takes". However,
> it seemed that the second iteration never finished.
>
> My surprise was big when I added a print instead of the list
> comprehension:
> new_logs=[]
> for count,l in enumerate(logs):
>   print count
>   if dist(pattern,l)>THERESHOLD:
>      new_logs.append(l)
>
> The surprise was that the displayed counter was running ~10 times
> slower on the second iteration of the while loop.
>
> I am a little lost. Anyone knows the reson of this behavior?  How
> should I write a program that deals with large data sets in python?
>
> Thanks a lot!
> marc magrans de abril
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list