store tag content with SGMLParser ... ?
Paul McGuire
bogus at bogus.net
Sat Mar 27 10:54:13 EST 2004
"Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
news:Hv6rKp.LoL at news.muni.cz...
> Hi Python people!
>
> I'd like to ask you a question about parsing html with SGMLParser class
> from module sgmllib. Is there a way I could get contents of a tag with
> certain properties? For example: I'd like to get a list of all contents
> and hrefs of <a> tags, which have certain value of "class" property:
>
> <a class="section" href="http://example.com">BLABLABLA</a>
>
> I'd like to get stored the "http://example.com" and the "BLABLABLA",
> while class="section". Probably with the use of SGMLParser method
> handle_data. Or better use different approach? Thanks for answers!:)
>
> #Luk
Lukas -
If all your section definitions are as fixed format as this, then a regexp
will probably do. If you have to deal with additional anchor attributes,
you could write a small SGML subset using pyparsing, just matching the tag
pattern you are looking for. With pyparsing, you don't have to define the
complete SGML syntax, just the desired pattern - then extract the results
using scanString. Here is an example that is tolerant of additional
attributes in the '<a class="section"...' tag.
-- Paul
===========
# get pyparsing at http://pyparsing.sourceforge.net
from pyparsing import Literal, quotedString, CharsNotIn, OneOrMore, Word,
alphas
someSGML = """
<SGML>
<tag>some stuff</tag>
<a class="section" href="http://example1.com">BLABLABLA</a>
more blah blah blah...
<a class="section" color="RED" href="http://example2.com">BLEBLEBLE</a>
<SGML_TAG>sldkjflsdkjflsdkjf</SGML_TAG>
<a class="section" href="http://example3.com" size="venti">BLIBLIBLI</a>
</SGML>
"""
tagBody = CharsNotIn('<').setResultsName("body")
href = Literal('href') + '=' + quotedString.setResultsName("href")
otherAttrDef = Word(alphas) + "=" + quotedString
sectionDef = ( Literal('<a class="section"') + OneOrMore( href |
otherAttrDef ) + '>' +
tagBody + "</a>" )
for match,start,stop in sectionDef.scanString( someSGML ):
print "href=",match.href
print "body=",match.body
print
============
>pythonw -u getSGMLrefs.py
href= "http://example1.com"
body= BLABLABLA
href= "http://example2.com"
body= BLEBLEBLE
href= "http://example3.com"
body= BLIBLIBLI
More information about the Python-list
mailing list