Regular Expression question
Duncan Booth
duncan.booth at invalid.invalid
Thu Jun 8 04:34:14 EDT 2006
Paul McGuire wrote:
>> import re
>> r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
>> for m in r.finditer(html):
>> print m.group('image')
>>
>
> Ouch - this fails to match any <img> tag that has some other
> attribute, such as "height" or "width", before the "src" attribute.
> www.yahoo.com has several such tags.
It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.
Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).
Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.
>>> ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
>>> NOTSRC = '(?!src=)' + ATTR
>>> PAT = '''<img\s(?:'''+NOTSRC +
'''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)'''
>>> htmlPage = '''<html><body><img width=42 src=fred.jpg><img
src=\"freda.jpg\"> <img title='the src="silly" title'
src='another'></body></html>'''
>>> for m in r.finditer(htmlPage):
print m.group('image')
fred.jpg
freda.jpg
>>>
More information about the Python-list
mailing list