Overlapping Regular Expression Matches With findall()
Bengt Richter
bokr at oz.net
Thu Dec 15 16:31:58 EST 2005
On Thu, 15 Dec 2005 20:33:42 +0000, Simon Brunning <simon at brunningonline.net> wrote:
>On 15 Dec 2005 12:26:07 -0800, Mystilleef <mystilleef at gmail.com> wrote:
>> I want a pattern that scans the entire string but avoids
>> returning duplicate matches. For example "cat", "cate",
>> "cater" may all well be valid matches, but I don't want
>> duplicate matches of any of them. I know I can filter the
>> list containing found matches myself, but that is somewhat
>> expensive for a list containing thousands of matches.
>
>Probably the cheapest way of de-duping the list would be to dump it
>straight into a set, provided that you aren't concerned about the
>order.
>
Or if concerned, maybe try a combination like:
>>> s = """\
... I want a pattern that scans the entire string but avoids
... returning duplicate matches. For example "cat", "cate",
... "cater" may all well be valid matches, but I don't want
... duplicate matches of any of them. I know I can filter the
... list containing found matches myself, but that is somewhat
... expensive for a list containing thousands of matches.
... """
>>> import re
>>> rxo = re.compile(r'cat(?:er|e)?')
>>> rxo.findall(s)
['cate', 'cat', 'cate', 'cater', 'cate']
>>> seen = set()
>>> [w for w in (m.group(0) for m in rxo.finditer(s)) if w not in seen and not seen.add(w)]
['cate', 'cat', 'cater']
BTW, note to put longer ambiguous match first in re, e.g., not r'cat(?:e|er)?') for above.
Regards,
Bengt Richter
More information about the Python-list
mailing list