[Tutor] Re: extracting html source

Magnus Lyckå magnus@thinkware.se
Thu Apr 17 02:55:02 2003


At Tue, 15 Apr 2003 06:11:55 -0500, pan@uchicago.edu wrote:
>Try this:
>
>a= 'aaa <span> bbb </span> ccc'
>
> >>> import re
>
> >>> re.findall('<span>.*</span>', htmlSource)  # <== [A]
>['<span> bbb </span>'] =20
>
> >>> re.findall('<span>(.*)</span>', htmlSource)  # <== [B]
>[' bbb ']

No, please don't! ;)

It's been said many times that REs are inadequate for HTML parsing,
and the Python standard library contains two HTML parsers already,
so why build your own? There is one in htmllib, and one which is
better suited for XHTML as well in the HTMLParser library.

Here's an example.

from HTMLParser import HTMLParser
import sys

class SpanExtractor(HTMLParser):
     show = False
     def handle_starttag(self, tag, attr):
         if tag == 'span':
             self.show = True
         if self.show:
             sys.stdout.write(self.get_starttag_text())
     def handle_endtag(self, tag):
         if self.show:
             sys.stdout.write(self.get_starttag_text())
         if tag == 'span':
             self.show = False
     def handle_data(self, data):
         if self.show:
             sys.stdout.write(data)

html='''<html><body>
bla bla bla
<span>in the span
in the span
<b>in the span</b>
in the span</SPAN>
bla bla
</body></html>'''

p = SpanExtractor()
p.feed(html)
p.close()

If you don't want to include the <span> tags, just swap
places of the two if blocks in handle_starttag and
handle_endtag.

Sure, this is a bit more code than the RE, but soon you
will notice that you need to do more stuff, and then this
is a much better platform, and one day, you might have a
'</span>' inside a comment which will cause problems etc.

A really nifty third party module which is great if you
have control over the HTML and can put id tags in it is
PyMeld. See below:

 >>> import PyMeld
 >>> html = '''<html>bla bla
... bla <span id='mySpan'>this
... is in
... the span</span>bla bla </html>'''
 >>> p = PyMeld.Meld(html)
 >>> p.mySpan
<PyMeld.Meld instance at 0x0081A7C8>
 >>> print p.mySpan
<span id='mySpan'>this
is in
the span</span>
 >>> p.mySpan = "Here is some text.\n"
 >>> p.mySpan += "Let's add some more..."
 >>> print p
<html>bla bla
bla <span id='mySpan'>Here is some text.
Let's add some more...</span>bla bla </html>

You see? You can easily modify parts...

 >>> print p.mySpan
<span id='mySpan'>Here is some text.
Let's add some more...</span>
 >>> print p.mySpan._content
Here is some text.
Let's add some more...

Or extract parts, with or without the surrounding tags.

 >>> p.mySpan.x = "y"
 >>> print p
<html>bla bla
bla <span x="y" id='mySpan'>Here is some text.
Let's add some more...</span>bla bla </html>

You can also add / change attributes...

 >>> del p.mySpan.x
 >>> print p
<html>bla bla
bla <span id='mySpan'>Here is some text.
Let's add some more...</span>bla bla </html>

Or remove them. There are more features available.
See http://www.entrian.com/PyMeld/


--
Magnus Lycka, magnus@thinkware.se
Thinkware AB, www.thinkware.se