[Tutor] Re: extracting html source
pan
pan" <prog@runsun.info
Mon Apr 14 16:28:14 2003
This is a multi-part message in MIME format.
--_b79e1f8e1ad6966428ab1b53560d2066b
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
> What's the simplest way to extract the value from a 'span' tag on=3D
> a specified web page (or any tag for that matter)? (i already=3D
> used urllib to get the html) and there is only one span tag on=3D
> the page.
>
> Regards,
> Wayne
> wkoorts@mweb.co.za
Hi Wayne,
Try this:
>>> htmlSource =3D 'aaa <span> bbb </span> ccc'
>>> import re
>>> re.findall('<span>.*</span>', htmlSource) # <=3D=3D [A]
['<span> bbb </span>'] =20
>>> re.findall('<span>(.*)</span>', htmlSource) # <=3D=3D [B]
[' bbb ']
Note the difference between [A] and [B]
If there's a '\n' in between <span> and </span>:
>>> b =3D '''aaa <span> bb
... bbb </span> ccc'''
>>> b
'aaa <span> bb\n bbb </span> ccc'
>>> re.findall(r'<span>[\w\s]*</span>',b)
['<span> bb\n bbb </span>']
>>> re.findall(r'<span>([\w\s]*)</span>',b)
[' bb\n bbb ']
More:
>>> c=3D''' aaa <span> bbb1=20
... bbb2
... bbb3
... </span>'''
>>> c
' aaa <span> bbb1 \nbbb2\nbbb3\n</span>'
>>> re.findall(r'<span>([\w\s]*)</span>',c)
[' bbb1 \nbbb2\nbbb3\n']
hth
pan
--_b79e1f8e1ad6966428ab1b53560d2066b
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<html>=0A<head>=0A<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-1">=0A</head>=0A<body>=0A > What's the simplest way to extract the value from a 'span' tag on=3D=0D<br>=0A > a specified web page (or any tag for that matter)? (i already=3D=0D<br>=0A > used urllib to get the html) and there is only one span tag on=3D=0D<br>=0A > the page.=0D<br>=0A >=0D<br>=0A > Regards,=0D<br>=0A > Wayne=0D<br>=0A > wkoorts@mweb.co.za=0D<br>=0A=0D<br>=0AHi Wayne,=0D<br>=0A=0D<br>=0ATry this:=0D<br>=0A=0D<br>=0A>>> htmlSource =3D 'aaa <span> bbb </span> ccc'=0D<br>=0A=0D<br>=0A>>> import re=0D<br>=0A=0D<br>=0A>>> re.findall('<span>.*</span>', htmlSource) # <=3D=3D [A]=0D<br>=0A['<span> bbb </span>'] =20=0D<br>=0A=0D<br>=0A>>> re.findall('<span>(.*)</span>', htmlSource) # <=3D=3D [B]=0D<br>=0A[' bbb ']=0D<br>=0A=0D<br>=0A=0D<br>=0ANote the difference between [A] and [B]=0D<br>=0A=0D<br>=0AIf there's a '\n' in between <span> and </span>:=0D<br>=0A=0D<br>=0A>>> b =3D '''aaa <span> bb=0D<br>=0A... bbb </span> ccc'''=0D<br>=0A=0D<br>=0A>>> b=0D<br>=0A'aaa <span> bb\n bbb </span> ccc'=0D<br>=0A=0D<br>=0A>>> re.findall(r'<span>[\w\s]*</span>',b)=0D<br>=0A['<span> bb\n bbb </span>']=0D<br>=0A=0D<br>=0A>>> re.findall(r'<span>([\w\s]*)</span>',b)=0D<br>=0A[' bb\n bbb ']=0D<br>=0A=0D<br>=0A=0D<br>=0AMore:=0D<br>=0A=0D<br>=0A>>> c=3D''' aaa <span> bbb1=20=0D<br>=0A... bbb2=0D<br>=0A... bbb3=0D<br>=0A... </span>'''=0D<br>=0A=0D<br>=0A>>> c=0D<br>=0A' aaa <span> bbb1 \nbbb2\nbbb3\n</span>'=0D<br>=0A=0D<br>=0A>>> re.findall(r'<span>([\w\s]*)</span>',c)=0D<br>=0A[' bbb1 \nbbb2\nbbb3\n']=0D<br>=0A=0D<br>=0A=0D<br>=0Ahth=0D<br>=0Apan<br>=0A</body></html>=0A
--_b79e1f8e1ad6966428ab1b53560d2066b--