[Tutor] Regular expressions question

Thu Dec 6 03:08:51 CET 2012

On Wed, Dec 5, 2012 at 7:13 PM, Ed Owens <eowens0124 at gmx.com> wrote:
>>>> str(string)
> '[<div class="wx-timestamp">\n<div class="wx-subtitle wx-timestamp">Updated:
> Dec 5, 2012, 5:08pm EST</div>\n</div>]'
>>>> m = re.search('":\b(\w+\s+\d+,\s+\d+,\s+\d+:\d+.m\s+\w+)<', str(string))
>>>> print m
> None

You need a raw string for the boundary marker \b (i.e the boundary
between \w and \W), else it creates a backspace control character.
Also, I don't see why you have ": at the start of the expression. This
works:

    >>> s = 'Updated: Dec 5, 2012, 5:08pm EST</div>'
    >>> m = re.search(r'\b(\w+\s+\d+,\s+\d+,\s+\d+:\d+.m\s+\w+)<', s)
    >>> m.group(1)
    'Dec 5, 2012, 5:08pm EST'

But wouldn't it be simpler and more reliable to use an HTML parser?