Help with parsing web page
Paul McGuire
ptmcg at austin.rr._bogus_.com
Tue Jun 15 10:07:50 EDT 2004
"RiGGa" <rigga at hasnomail.com> wrote in message
news:aFkzc.15232$NK4.2386151 at stones.force9.net...
> Hi,
>
> I want to parse a web page in Python and have it write certain values out
to
> a mysql database. I really dont know where to start with parsing the html
> code ( I can work out the database part ). I have had a look at htmllib
> but I need more info. Can anyone point me in the right direction , a
> tutorial or something would be great.
>
> Many thanks
>
> RiGga
>
RiGGa -
The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.
This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:
<td>ip-address</td><td>arbitrary text giving server location</td>
For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>
(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)
The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia
Download pyparsing at http://pyparsing.sourceforge.net .
-- Paul
# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib
integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd
# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()
addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc
More information about the Python-list
mailing list