Slow down while creating a big list and iterating over it
MRAB
python at mrabarnett.plus.com
Sun Jan 31 17:42:52 EST 2010
marc magrans de abril wrote:
> Hi!
>
> ...I have found a good enough solution, although it only works if the
> number of patterns (clusters) is not very big:
> def classify(f):
> THERESHOLD=0.1
>
> patterns={}
> for l in enumerate(f):
> found = False
> for p,c in patterns.items():
> if dist(l,p) < THERESHOLD:
> found=True
> patterns[p] = c +1
>
> if not found:
> patterns[l] = 1
>
> return patterns
>
> This algorithm is O(n*np*m^2). Where n is the number of logs, np the
> number of patterns, and m is the log length (i.e. m^2 is the distance
> cost). So it's way better O(n^2*m^2) and I can run it for some hours
> to get back the results.
>
> I wonder if there is a single threaded/process clustering algorithm
> than runs in O(n)?
>
Your original code used the first entry in the remaining logs for each
pattern, but your new code stores the patterns in a dict, which is
unordered, so you might get different results.
But that doesn't matter, because your new code increments the count when
it has found a match, and then continues looking, so it might match and
increment more than once.
Finally, your original code treated it as a match if distance <=
threshold but your new code treats it as a match if distance <
threshold.
patterns = []
for x in logs:
for index, (pat, count) in enumerate(patterns):
if dist(pat, x) <= THRESHOLD:
patterns[index] = pat, count + 1
break
else:
# Didn't break out of the loop, therefore no match.
patterns.append((x, 1))
More information about the Python-list
mailing list