Regular Expression question

Thu Jun 8 04:34:14 EDT 2006

Paul McGuire wrote:

>> import re
>> r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
>> for m in r.finditer(html):
>>     print m.group('image')
>>
> 
> Ouch - this fails to match any <img> tag that has some other
> attribute, such as "height" or "width", before the "src" attribute. 
> www.yahoo.com has several such tags.

It also fails to match any image tag where the src attribute is quoted 
using single quotes, or where the src attribute is not enclosed in quotes 
at all.

Handle all of that correctly in the regex and the beautiful soup or 
pyparsing options look even more attractive. In fact, if anyone can write a 
regex which matches the source attribute in a single named group, and 
correctly handles double, single and unquoted attributes, I'll admit to 
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets 
confused by other attributes if they contain spaces.

>>> ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
>>> NOTSRC = '(?!src=)' + ATTR
>>> PAT = '''<img\s(?:'''+NOTSRC + 
'''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)'''
>>> htmlPage = '''<html><body><img width=42 src=fred.jpg><img 
src=\"freda.jpg\"> <img title='the src="silly" title' 
src='another'></body></html>'''
>>> for m in r.finditer(htmlPage):
    print m.group('image')

fred.jpg
freda.jpg
>>>