[Tutor] Avoiding repetetive pattern match in re module
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Thu Jan 5 20:27:30 CET 2006
> Instead of writing a new 'for,if' loop to filter the repetetive tags
> from the list, is there something that I can add in the re itself to
> match the pattern only once?
Hi Intercodes,
As far as I know, no: regular expressions don't have the capacity to
remember what tags they've matched in previous matching runs.
You may find "sets" useful in filtering out duplicates:
http://www.python.org/doc/lib/module-sets.html
For example:
######
>>> from sets import Set as set
>>> set([3, 1, 4, 1, 5, 9, 2, 6])
Set([1, 2, 3, 4, 5, 6, 9])
######
So you should not need to code much extra logic to do the duplication
filtering: sets do this for you already.
As an aside, I want to second Alan's recommendation to avoid doing so much
string concatenation: it's actually very expensive to do the following:
#########################
ans = ''
while 1:
data=file1.readline()
if data=="":
break
ans=ans+data
#########################
>From a technical standpoint, it has "quadratic" complexity in terms of
what work the computer is doing. It's related to the mathematical idea
that 1 + 2 + 3 + 4 + ... + n = n(n+1).
http://en.wikipedia.org/wiki/Triangle_number
The string concatenation that the code does above is analogous because it
rebuilds larger and larger strings in a similar fashion. Alan's approach,
to use file1.read(), sucks up the string in one shot, so it's much less
computationally expensive.
If you do end up having to concatenate a lot of strings together, use the
join() method of lists instead. We can talk about this in more detail if
you'd like.
More information about the Tutor
mailing list