[Tutor] filtering a webpage for plucking to a Palm

Kent Johnson kent37 at tds.net
Sun Jun 26 15:31:25 CEST 2005

Brian van den Broek wrote:
> Hi all,
> I have a Palm handheld, and use the excellent (and written in Python) 
> Plucker <http://www.plkr.org/> to spider webpages and format the 
> results for viewing on the Palm.
> One site I 'pluck' is the Daily Python URL 
> <http://www.pythonware.com/daily/>. From the point of view of a daily 
> custom 'newspaper' everything but the last day or two of URLs is so 
> much cruft. (The cruft would be the total history of the last 
> seven'ish days, the navigation links for www.pythonware.com, etc.)
> Today, I wrote a script to parse the Daily URL, and create a minimal 
> local html page including nothing but the last n items, n links, or 
> last n days worth of links. (Which is employed is a user option.) 
> Then, I pluck that, rather than the actual Daily URL site. Works 
> great. :-)  (If anyone on the list is a fellow plucker'er and would be 
> interested in my script, I'm happy to share.)
> In anticipation of wanting to do the same thing to other sites, I've 
> spent a bit of time abstracting it. I've made some real progress. But, 
> before I finish up, I've a voice in the back of my head asking if 
> maybe I'm re-inventing the wheel.
> To my shame, I've not spent very much time at all exploring available 
> frameworks and modules for any domain, and almost none for web-related 
> tasks. So, does anyone know of any modules or frameworks which would 
> make the sort of task I am describing easier?
> The difficulty in making my routine general is that pretty much each 
> site will need its own code for identifying what counts as a distinct 
> item (such as a URL and its description in the Daily URL) and what 
> counts as a distinct block of items (such as a days worth of Daily URL 
> items). I can't imagine there's a way around that, but if someone else 
> has done much of the work in setting up the general structure to be 
> tweaked for each site, that'd be good to know. (Doesn't feel like one 
> that would be googleable.)

Beautiful Soup can help with parsing and accessing the web page. You could certainly write your plucker on top of it.

Alternately ElementTidy might help. It can parse web pages and it has limited XPath support. XPath might be a good language for expressing your plucking rules.

An ideal package would be one that parses real-world HTML and has full XPath support, but I don't know of such a thing...maybe amara or lxml?


More information about the Tutor mailing list