[Tutor] Re: extracting html source
pan@uchicago.edu
pan@uchicago.edu
Tue Apr 15 07:14:02 2003
Hi all,
My apology to this group for those redundant and unreadable emails
I sent in the previous Tutor digest, Vol 1 #2361. It's due to a webmail
client of a webhost company called icdsoft. They have checkboxes for
[ ] don't use MIME format
[ ] don't use HTML format
I thought I could send a pure ASCII email by unchecking both of them but
obviously I was wrong. I have no idea why the emails I sent out from their
webmail only cause problems when sending to this list. But I've given up
using their webmail client.
Below is a readable reply (hope so) to Wayne's question:
Hi Wayne,
Try this:
a= 'aaa <span> bbb </span> ccc'
>>> import re
>>> re.findall('<span>.*</span>', htmlSource) #
<=3D=3D [A]
['<span> bbb </span>'] =20
>>> re.findall('<span>(.*)</span>', htmlSource) #
<=3D=3D [B]
[' bbb ']
Note the difference between [A] and [B]
If there's a '\n' in between <span> and </span>:
>>> b =3D '''aaa <span> bb
.. bbb </span> ccc'''
>>> b
'aaa <span> bb\n bbb </span> ccc'
>>> re.findall(r'<span>[\w\s]*</span>',b)
['<span> bb\n bbb </span>']
>>> re.findall(r'<span>([\w\s]*)</span>',b)
[' bb\n bbb ']
More:
>>> c=3D''' aaa <span> bbb1=20
.. bbb2
.. bbb3
.. </span>'''
>>> c
' aaa <span> bbb1 \nbbb2\nbbb3\n</span>'
>>> re.findall(r'<span>([\w\s]*)</span>',c)
[' bbb1 \nbbb2\nbbb3\n']
hth
pan