Regular Expression question

Frank Potter could.net at gmail.com
Wed Jun 7 21:39:09 EDT 2006


pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
    print m.group('image')

I got these rusults:
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif

On 6/8/06, Paul McGuire <ptmcg at austin.rr._bogus_.com> wrote:
> <ken.carlino at gmail.com> wrote in message
> news:1149714949.542234.148800 at y43g2000cwc.googlegroups.com...
> > Hi,
> > I am new to python regular expression, I would like to use it to get an
> > attribute of an html element from an html file?
> >
> > for example, I was able to read the html file using this:
> >    req = urllib2.Request(url=acaURL)
> >     f = urllib2.urlopen(req)
> >
> >     data = f.read()
> >
> > my question is how can I just get the src attribute value of an img
> > tag?
> > something like this:
> > (.*)<img src="href of the image source">(.*)
> >
> > I need to get the href of the image source.
> >
> > Thanks.
> >
>
> As Fredrik pointed out, re's are not the only tool out there.  Here's a
> pyparsing solution.
>
> -- Paul
>
>
> import pyparsing
> import urllib
>
> # define HTML tag format using makeHTMLTags helper
> # (we don't really care about the ending </img> tag,
> # even though makeHTMLTags returns definitions for both
> # starting and ending tag patterns)
> imgStartTag, dummy = pyparsing.makeHTMLTags("img")
>
> # get HTML source from some web site
> htmlPage = urllib.urlopen("http://www.yahoo.com")
> htmlSource = htmlPage.read()
> htmlPage.close()
>
> # scan HTML source, printing SRC attribute from each <img> tag
> for tokens,start,end in imgStartTag.scanString(htmlSource):
>     print tokens.src
>
>
> Prints:
>
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif
> http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg
> http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list