[Tutor] Why doesn't this regex match???

Sat, 09 Feb 2002 01:25:25 -0500

[Sheila King]
> OK, I'm having some trouble with using the re module for regular
> expression matching. (I'm very new to using regular expressions, so I
> suppose I could be doing something really stupid?)

Congratulations:  you win, but only partly <wink>.

> Here is a session with the interactive interpreter:
>
> Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
> Type "copyright", "credits" or "license" for more information.
> IDLE 0.8 -- press F1 for help
> >>> import re
> >>> searchstring = 'ADV: FREE FREE OFFERZ!!!!'
> >>> pattern = 'adv:'
> >>> p = re.compile(r'\b%s\b' % pattern)
> >>> result = p.search(searchstring, re.IGNORECASE)
> >>> result
> >>> print result
> None
>
> I would have expected to get a match on the above situation.

This is a pretty amazing failure:  the second argument to p.search is the
position in the string at which to begin the search (read the docs for
this).  re.IGNORECASE doesn't make any real sense there, but it just so
happens that

>>> re.IGNORECASE
2
>>>

So you're really asking to search 'V: FREE FREE OFFERZ!!!!' for 'ADV:', and
of course it isn't found.  If you compile the pattern like this instead:

    p = re.compile(r'\b%s\b' % re.escape(pattern), re.IGNORECASE)

and search via plain

    p.search(searchstring)

you'll be in much better shape, but it *still* won't match.  That's because
of the ":\b" in your pattern:  \b matches only at a word boundary, which
means an alphabetic character must be on one side and a non-alphabetic on
the other.  In ":\b", the colon is non-alphabetic, so this can only match
things like

   :A
   :a

etc.

Backing off to something simpler is always a good idea until you're
absolutely certain how regexps work:

>>> p = re.compile('adv:', re.IGNORECASE)
>>> p.search(searchstring)
<_sre.SRE_Match object at 0x00794D10>
>>> _.group(0)
'ADV:'
>>>

*Now* you can try making the pattern fancier again.

> Now when I try this:
>
> >>> searchstring = 'Viagra without a prescription!'
> >>> pattern = 'viagra'
> >>> p = re.compile(r'\b%s\b' % pattern)
> >>> result = p.search(searchstring, re.IGNORECASE)
> >>> result
> >>> print result
> None
> >>> searchstring = 'get viagra without a prescription!'
> >>> pattern = 'viagra'
> >>> p = re.compile(r'\b%s\b' % pattern)
> >>> result = p.search(searchstring, re.IGNORECASE)
> >>> result
> <_sre.SRE_Match object at 0x00AF4010>
> >>>
>
> If 'viagra' comes at the beginning, it doesn't match, but if it comes in
> the middle it does.

This is again because re.IGNORECASE is being used in a context where it
doesn't make sense (it's telling .search() to ignore the first two
characters of the string).

> So, one starts to think that \b, the word boundary, won't match at the
> beginning of a string (which is totally contrary to what I would expect).

Yes, \b matches at the start of a string.

> ...
> The task I am trying to accomplish right now is this:
>
> I have a list of strings (common words and phrases one might expect to
> find in a Spam email, if that wasn't obvious from the above examples)
> and I want to do a regular expression search against the subject of an
> email and see if I get a match or not (after which I handle the email).

regexps are a wonderful tool at times, but (a) are difficult to use
correctly, and (b) get used for all sorts of things they're not good at.
One of the best quotes on the topic comes from Jamie Zawinski:

    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions."  Now they have two problems.

Here's another approach to your problem:

1. Convert all your phrases to (say) lowercase first.

2. Say the list is spamphrases, and the subject line is "subject".

   Then

       s = subject.lower()
       isjunk = 0
       for phrase in spamphrases:
           if s.find(phrase) >= 0:
               isjunk = 1
               break

   is worth considering.  It won't tie your head in knots, anyway.