[Tutor] Re: extracting html source

pan pan" <prog@runsun.info
Mon Apr 14 16:28:14 2003


This is a multi-part message in MIME format.

--_b79e1f8e1ad6966428ab1b53560d2066b
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

   &gt;  What's the simplest way to extract the value from a 'span' tag on=3D
   &gt;  a specified web page (or any tag for that matter)? (i already=3D
   &gt;  used urllib to get the html) and there is only one span tag on=3D
   &gt;  the page.
   &gt;
   &gt;  Regards,
   &gt;  Wayne
   &gt;  wkoorts@mweb.co.za

Hi Wayne,

Try this:

&gt;&gt;&gt; htmlSource =3D 'aaa &lt;span&gt; bbb &lt;/span&gt; ccc'

&gt;&gt;&gt; import re

&gt;&gt;&gt; re.findall('&lt;span&gt;.*&lt;/span&gt;', htmlSource)  # &lt;=3D=3D [A]
['&lt;span&gt; bbb &lt;/span&gt;'] =20

&gt;&gt;&gt; re.findall('&lt;span&gt;(.*)&lt;/span&gt;', htmlSource)  # &lt;=3D=3D [B]
[' bbb ']


Note the difference between [A] and [B]

If there's a '\n' in between &lt;span&gt; and &lt;/span&gt;:

&gt;&gt;&gt; b =3D '''aaa &lt;span&gt; bb
...      bbb &lt;/span&gt; ccc'''

&gt;&gt;&gt; b
'aaa &lt;span&gt; bb\n     bbb &lt;/span&gt; ccc'

&gt;&gt;&gt; re.findall(r'&lt;span&gt;[\w\s]*&lt;/span&gt;',b)
['&lt;span&gt; bb\n     bbb &lt;/span&gt;']

&gt;&gt;&gt; re.findall(r'&lt;span&gt;([\w\s]*)&lt;/span&gt;',b)
[' bb\n     bbb ']


More:

&gt;&gt;&gt; c=3D''' aaa &lt;span&gt; bbb1=20
... bbb2
... bbb3
... &lt;/span&gt;'''

&gt;&gt;&gt; c
' aaa &lt;span&gt; bbb1 \nbbb2\nbbb3\n&lt;/span&gt;'

&gt;&gt;&gt; re.findall(r'&lt;span&gt;([\w\s]*)&lt;/span&gt;',c)
[' bbb1 \nbbb2\nbbb3\n']


hth
pan

--_b79e1f8e1ad6966428ab1b53560d2066b
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>=0A<head>=0A<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-1">=0A</head>=0A<body>=0A   &gt;  What's the simplest way to extract the value from a 'span' tag on=3D=0D<br>=0A   &gt;  a specified web page (or any tag for that matter)? (i already=3D=0D<br>=0A   &gt;  used urllib to get the html) and there is only one span tag on=3D=0D<br>=0A   &gt;  the page.=0D<br>=0A   &gt;=0D<br>=0A   &gt;  Regards,=0D<br>=0A   &gt;  Wayne=0D<br>=0A   &gt;  wkoorts@mweb.co.za=0D<br>=0A=0D<br>=0AHi Wayne,=0D<br>=0A=0D<br>=0ATry this:=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; htmlSource =3D 'aaa &lt;span&gt; bbb &lt;/span&gt; ccc'=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; import re=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; re.findall('&lt;span&gt;.*&lt;/span&gt;', htmlSource)  # &lt;=3D=3D [A]=0D<br>=0A['&lt;span&gt; bbb &lt;/span&gt;'] =20=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; re.findall('&lt;span&gt;(.*)&lt;/span&gt;', htmlSource)  # &lt;=3D=3D [B]=0D<br>=0A[' bbb ']=0D<br>=0A=0D<br>=0A=0D<br>=0ANote the difference between [A] and [B]=0D<br>=0A=0D<br>=0AIf there's a '\n' in between &lt;span&gt; and &lt;/span&gt;:=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; b =3D '''aaa &lt;span&gt; bb=0D<br>=0A...      bbb &lt;/span&gt; ccc'''=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; b=0D<br>=0A'aaa &lt;span&gt; bb\n     bbb &lt;/span&gt; ccc'=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; re.findall(r'&lt;span&gt;[\w\s]*&lt;/span&gt;',b)=0D<br>=0A['&lt;span&gt; bb\n     bbb &lt;/span&gt;']=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; re.findall(r'&lt;span&gt;([\w\s]*)&lt;/span&gt;',b)=0D<br>=0A[' bb\n     bbb ']=0D<br>=0A=0D<br>=0A=0D<br>=0AMore:=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; c=3D''' aaa &lt;span&gt; bbb1=20=0D<br>=0A... bbb2=0D<br>=0A... bbb3=0D<br>=0A... &lt;/span&gt;'''=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; c=0D<br>=0A' aaa &lt;span&gt; bbb1 \nbbb2\nbbb3\n&lt;/span&gt;'=0D<br>=0A=0D<br>=0A&gt;&gt;&gt; re.findall(r'&lt;span&gt;([\w\s]*)&lt;/span&gt;',c)=0D<br>=0A[' bbb1 \nbbb2\nbbb3\n']=0D<br>=0A=0D<br>=0A=0D<br>=0Ahth=0D<br>=0Apan<br>=0A</body></html>=0A

--_b79e1f8e1ad6966428ab1b53560d2066b--