Beautiful Soup Looping Extraction Question

Paul McGuire ptmcg at austin.rr.com
Mon Mar 24 19:55:41 EDT 2008


On Mar 24, 6:32 pm, Tess <test... at gmail.com> wrote:
> Hello All,
>
> I have a Beautiful Soup question and I'd appreciate any guidance the
> forum can provide.
>

I *know* you're using Beautiful Soup, and I *know* that BS is the de
facto HTML parser/processor library.  Buuuuuut, I just couldn't help
myself in trying a pyparsing scanning approach to your problem.  See
the program below for a pyparsing treatment of your question.

-- Paul


"""
My goal is to extract all elements where the following is true: <p
align="left"> and <div align="center">.
"""
from pyparsing import makeHTMLTags, withAttribute, keepOriginalText,
SkipTo

p,pEnd = makeHTMLTags("P")
p.setParseAction( withAttribute(align="left") )
div,divEnd = makeHTMLTags("DIV")
div.setParseAction( withAttribute(align="center") )

# basic scanner for matching either <p> or <div> with desired attrib
value
patt = ( p + SkipTo(pEnd) + pEnd ) | ( div + SkipTo(divEnd) + divEnd )
patt.setParseAction( keepOriginalText )

print "\nBasic scanning"
for match in patt.searchString(html):
    print match[0]

# simplified data access, by adding some results names
patt = ( p + SkipTo(pEnd)("body") + pEnd )("P") | \
        ( div + SkipTo(divEnd)("body") + divEnd )("DIV")
patt.setParseAction( keepOriginalText )

print "\nSimplified field access using results names"
for match in patt.searchString(html):
    if match.P:
        print "P -", match.body
    if match.DIV:
        print "DIV -", match.body

Prints:

Basic scanning
<p align="left">P1</p>
<div align="center">div2a</div>
<div align="center">div2b</div>
<p align="left">P3</p>
<div align="center">div3b</div>
<p align="left">P4</p>
<div align="center">div4b</div>

Simplified field access using results names
P - P1
DIV - div2a
DIV - div2b
P - P3
DIV - div3b
P - P4
DIV - div4b



More information about the Python-list mailing list