Processing XML that's embedded in HTML

Paul McGuire ptmcg at
Wed Jan 23 00:31:19 CET 2008

On Jan 22, 10:57 am, Mike Driscoll <kyoso... at> wrote:
> Hi,
> I need to parse a fairly complex HTML page that has XML embedded in
> it. I've done parsing before with the xml.dom.minidom module on just
> plain XML, but I cannot get it to work with this HTML page.
> The XML looks like this:

Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
can help lift the interesting bits and leave the rest alone.  Try this
program out:

from pyparsing import

htmlWithEmbeddedXml = """
<b>Hey! this is really bold!</b>

<Row status="o">
    <Name>Doe, John</Name>
    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

  <Row status="o">
    <Name>Doe, Jane</Name>
    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

<tr><Td>this is in a table, woo-hoo!</td>
more HTML
blah blah blah...

# define pyparsing expressions for XML tags
rowStart,rowEnd                   = makeXMLTags("Row")
relationshipStart,relationshipEnd = makeXMLTags("Relationship")
priorityStart,priorityEnd         = makeXMLTags("Priority")
startDateStart,startDateEnd       = makeXMLTags("StartDate")
stopsExistStart,stopsExistEnd     = makeXMLTags("StopsExist")
nameStart,nameEnd                 = makeXMLTags("Name")
addressStart,addressEnd           = makeXMLTags("Address")

# define some useful expressions for data of specific types
integer = Word(nums)
date = Combine(Word(nums,exact=2)+"/"+
yesOrNo = oneOf("Yes No")

# conversion parse actions
integer.setParseAction(lambda t: int(t[0]))
yesOrNo.setParseAction(lambda t: t[0]=='Yes')
# could also define a conversion for date if you really wanted to

# define format of a <Row>, plus assign results names for each data
rowRec = rowStart + \
    relationshipStart + SkipTo(relationshipEnd)("relationship") +
relationshipEnd + \
    priorityStart + integer("priority") + priorityEnd + \
    startDateStart + date("startdate") + startDateEnd + \
    stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
    nameStart + SkipTo(nameEnd)("name") + nameEnd + \
    addressStart + SkipTo(addressEnd)("address") + addressEnd + \

# set filtering parse action

# find all matching rows, matching grammar and filtering parse action
rows = rowRec.searchString(htmlWithEmbeddedXml)

# print the results (uncomment r.dump() statement to see full
# result for each row)
for r in rows:
    # print r.dump()
    print r.relationship
    print r.priority
    print r.startdate
    print r.stopsexist
    print r.address

This prints:
Doe, John
1905 S 3rd Ave , Hicksville IA 99999

In addition to parsing this data, some conversions were done at parse
time, too - "1" was converted to the value 1, and "No" was converted
to False.  These were done by the conversion parse actions.  The
filtering just for Row's containing Relationship="Owner" and
Priority=1 was done in a more global parse action, called
withAttribute.  If you comment this line out, you will see that both
rows get retrieved.

-- Paul
(Find out more about pyparsing at

More information about the Python-list mailing list