PyParsing module or HTMLParser

Paul McGuire ptmcg at austin.rr.com
Wed Mar 30 21:06:17 CEST 2005


Lad -

Well, here's what I've got so far.  I'll leave the extraction of the
description to you as an exercise, but as a clue, it looks like it is
delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at
the beginning, and "Quantity: 500<br>" at the end, where 500 could be
any number.  This program will print out:

['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function
Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller']
['Contact:', 'Mr. Simon Cheung']
['Company:', 'Lanjin Electronics Co., Ltd.']
['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung
Wan     , Hong Kong\n                , HK\n                ( Hong Kong
)']
['Phone:', '852    35763877']
['Fax:', '852    31056238']
['Mobile:', '852-96439737']

So I think pyparsing will get you pretty far along the way.  Code
attached below (unfortunately, I am posting thru Google Groups, which
strips leading whitespace, so I have inserted '.'s to preserve code
indentation; just strip the leading '.' characters).

-- Paul

===================================
from pyparsing import *
import urllib

# get input data
url = "http://www.ourglobalmarket.com/Test.htm"
page = urllib.urlopen( url )
pageHTML = page.read()
page.close()

#~ I would like to extract the tittle ( it is below Lanjin Electronics
#~ Co., Ltd. )
#~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

#~ description - below the tittle next to the picture
#~ Contact person
#~ Company name
#~ Address
#~ fax
#~ phone
#~ Website Address

LANGBRK = Literal("<")
RANGBRK = Literal(">")
SLASH = Literal("/")
tagAttr = Word(alphanums) + "=" + dblQuotedString

# helpers for defining HTML tag expressions
def startTag( tagname ):
....return ( LANGBRK + CaselessLiteral(tagname) + \
...............ZeroOrMore(tagAttr) + RANGBRK ).suppress()
def endTag( tagname ):
....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK
).suppress()
def makeHTMLtags( tagname ):
....return startTag(tagname), endTag(tagname)
def strong( expr ):
....return strongStartTag + expr + strongEndTag

strongStartTag, strongEndTag = makeHTMLtags("strong")
titleStart, titleEnd = makeHTMLtags("title")
tdStart, tdEnd = makeHTMLtags("td")
h1Start, h1End = makeHTMLtags("h1")

title = titleStart + SkipTo( titleEnd ).setResultsName("title") +
titleEnd
contactPerson = tdStart + h1Start + \
...............SkipTo( h1End ).setResultsName("contact")
company   = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \
...............SkipTo( tdEnd ).setResultsName("company")
address   = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \
...............SkipTo( tdEnd ).setResultsName("address")
phoneNum  = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \
...............SkipTo( tdEnd ).setResultsName("phoneNum")
faxNum    = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \
...............SkipTo( tdEnd ).setResultsName("faxNum")
mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \
...............SkipTo( tdEnd ).setResultsName("mobileNum")
webSite   = ( tdStart + strong("Website Address:") + tdEnd + tdStart )
+ \
...............SkipTo( tdEnd ).setResultsName("webSite")
scrapes = title | contactPerson | company | address | phoneNum | faxNum
| mobileNum | webSite

# use parse actions to remove hyperlinks
linkStart, linkEnd = makeHTMLtags("a")
linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd
def stripHyperLink(s,l,t):
....return [ t[0], linkExpr.transformString( t[1] ) ]
company.setParseAction( stripHyperLink )

# use parse actions to add labels for data elements that don't
# have labels in the HTML
def prependLabel(pre):
....def prependAction(s,l,t):
........return [pre] + t[:]
....return prependAction
title.setParseAction( prependLabel("Title:") )
contactPerson.setParseAction( prependLabel("Contact:") )

for tokens,start,end in scrapes.scanString( pageHTML ):
....print tokens




More information about the Python-list mailing list