catastrophic regexp, help!

alfasub000 at gmail.com alfasub000 at gmail.com
Thu Jun 12 07:41:16 CEST 2008


On Jun 11, 11:07 pm, cirfu <circularf... at yahoo.se> wrote:
> On 11 Juni, 10:25, Chris <cwi... at gmail.com> wrote:
>
>
>
> > On Jun 11, 6:20 am, cirfu <circularf... at yahoo.se> wrote:
>
> > > pat = re.compile("(\w* *)*")
> > > this matches all sentences.
> > > if fed the string "are you crazy? i am" it will return "are you
> > > crazy".
>
> > > i want to find a in a big string a sentence containing Zlatan
> > > Ibrahimovic and some other text.
> > > ie return the first sentence containing the name Zlatan Ibrahimovic.
>
> > > patzln = re.compile("(\w* *)* zlatan ibrahimovic (\w* *)*")
> > > should do this according to regexcoach but it seems to send my
> > > computer into 100%CPU-power and not closable.
>
> > Maybe something like this would be of use...
>
> > def sentence_locator(s, sub):
> >     cnt = s.upper().count(sub.upper())
> >     if not cnt:
> >         return None
> >     tmp = []
> >     idx = -1
> >     while cnt:
> >         idx = s.upper().find(sub.upper(), (idx+1))
> >         a = -1
> >         while True:
> >             b = s.find('.', (a+1), idx)
> >             if b == -1:
> >                 b = s.find('.', idx)
> >                 if b == -1:
> >                     tmp.append(s[a+1:])
> >                     break
> >                 tmp.append(s[a+1:b+1])
> >                 break
> >             a = b
> >         cnt -= 1
> >     return tmp
>
> yes, seems very unpythonic though :)
> must be a simpler way that isnt slow as hell.

Why wouldn't you use character classes instead of groups? i.e:

    pat = re.compile(r'([ \w]*Zlatan Ibrahimivoc[ \w]*)')
    sentence = re.match(text).groups()

As has been mentioned earlier, certain evil combinations of regular
expressions and groups will cause python's regular expression engine
to go (righteously) crazy as they require the internal state machine
to branch out exponentially.



More information about the Python-list mailing list