[Tutor] html scrapeing

Bob Gailer bgailer at sbcglobal.net
Thu Jun 30 04:49:18 CEST 2005


At 10:36 AM 6/26/2005, Nathan Hughes wrote:
>Ive been looking for way to scrape the data from a html table, but dont 
>know even where to start, or how to do..
>
>an example can be found here of the table ( 
><http://www.dragon256.plus.com/timer.html>http://www.dragon256.plus.com/timer.html 
>) - i'd like to extract all the data except for the delete column and then 
>just print each row..

Use module urllib2 for obtaining the page source:

import urllib2
page = urllib2.urlopen("http://www.dragon256.plus.com/timer.html")
html = page.readlines()

You now have a list of lines.

Now you can use any number of string parsing tools to locate lines starting 
with <tr> to find each new row, then <td> to find each cell, then search 
past the tag(s) to find the cell text.
You have 3 cases to deal with:

<td class='normal' align='left'><a href='javascript:OnTimer 
(1)'>Glastonbury 2005</a></td>

<td class='normal' align='left'>BBC THREE</td>

<td class='normal' align='middle'><input type='checkbox' onclick='OnDelete 
(1)'></td>

Is that enough to get you started?

Bob Gailer
mailto:bgailer at alum.rpi.edu
510 558 3275 home
720 938 2625 cell  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050629/71d52573/attachment.htm


More information about the Tutor mailing list