store tag content with SGMLParser ... ?

Paul McGuire bogus at bogus.net
Sat Mar 27 10:54:13 EST 2004


"Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
news:Hv6rKp.LoL at news.muni.cz...
> Hi Python people!
>
> I'd like to ask you a question about parsing html with SGMLParser class
> from module sgmllib. Is there a way I could get contents of a tag with
> certain properties? For example: I'd like to get a list of all contents
> and hrefs of <a> tags, which have certain value of "class" property:
>
> <a class="section" href="http://example.com">BLABLABLA</a>
>
> I'd like to get stored the "http://example.com" and the "BLABLABLA",
> while class="section". Probably with the use of SGMLParser method
> handle_data. Or better use different approach? Thanks for answers!:)
>
> #Luk
Lukas -

If all your section definitions are as fixed format as this, then a regexp
will probably do.  If you have to deal with additional anchor attributes,
you could write a small SGML subset using pyparsing, just matching the tag
pattern you are looking for.  With pyparsing, you don't have to define the
complete SGML syntax, just the desired pattern - then extract the results
using scanString.  Here is an example that is tolerant of additional
attributes in the '<a class="section"...' tag.

-- Paul

===========
# get pyparsing at http://pyparsing.sourceforge.net
from pyparsing import Literal, quotedString, CharsNotIn, OneOrMore, Word,
alphas

someSGML = """
<SGML>
<tag>some stuff</tag>
<a class="section" href="http://example1.com">BLABLABLA</a>
more blah blah blah...
<a class="section" color="RED" href="http://example2.com">BLEBLEBLE</a>
<SGML_TAG>sldkjflsdkjflsdkjf</SGML_TAG>
<a class="section" href="http://example3.com" size="venti">BLIBLIBLI</a>
</SGML>
"""

tagBody = CharsNotIn('<').setResultsName("body")
href = Literal('href') + '=' + quotedString.setResultsName("href")
otherAttrDef = Word(alphas) + "=" + quotedString
sectionDef = ( Literal('<a class="section"')  + OneOrMore( href |
otherAttrDef ) + '>' +
               tagBody + "</a>" )

for match,start,stop in sectionDef.scanString( someSGML ):
    print "href=",match.href
    print "body=",match.body
    print

============
>pythonw -u getSGMLrefs.py
href= "http://example1.com"
body= BLABLABLA

href= "http://example2.com"
body= BLEBLEBLE

href= "http://example3.com"
body= BLIBLIBLI







More information about the Python-list mailing list