[Tutor] filtering a webpage for plucking to a Palm

Sun Jun 26 15:31:25 CEST 2005

Brian van den Broek wrote:
> Hi all,
> 
> I have a Palm handheld, and use the excellent (and written in Python) 
> Plucker <http://www.plkr.org/> to spider webpages and format the 
> results for viewing on the Palm.
> 
> One site I 'pluck' is the Daily Python URL 
> <http://www.pythonware.com/daily/>. From the point of view of a daily 
> custom 'newspaper' everything but the last day or two of URLs is so 
> much cruft. (The cruft would be the total history of the last 
> seven'ish days, the navigation links for www.pythonware.com, etc.)
> 
> Today, I wrote a script to parse the Daily URL, and create a minimal 
> local html page including nothing but the last n items, n links, or 
> last n days worth of links. (Which is employed is a user option.) 
> Then, I pluck that, rather than the actual Daily URL site. Works 
> great. :-)  (If anyone on the list is a fellow plucker'er and would be 
> interested in my script, I'm happy to share.)
> 
> In anticipation of wanting to do the same thing to other sites, I've 
> spent a bit of time abstracting it. I've made some real progress. But, 
> before I finish up, I've a voice in the back of my head asking if 
> maybe I'm re-inventing the wheel.
> 
> To my shame, I've not spent very much time at all exploring available 
> frameworks and modules for any domain, and almost none for web-related 
> tasks. So, does anyone know of any modules or frameworks which would 
> make the sort of task I am describing easier?
> 
> The difficulty in making my routine general is that pretty much each 
> site will need its own code for identifying what counts as a distinct 
> item (such as a URL and its description in the Daily URL) and what 
> counts as a distinct block of items (such as a days worth of Daily URL 
> items). I can't imagine there's a way around that, but if someone else 
> has done much of the work in setting up the general structure to be 
> tweaked for each site, that'd be good to know. (Doesn't feel like one 
> that would be googleable.)

Beautiful Soup can help with parsing and accessing the web page. You could certainly write your plucker on top of it.
http://www.crummy.com/software/BeautifulSoup/

Alternately ElementTidy might help. It can parse web pages and it has limited XPath support. XPath might be a good language for expressing your plucking rules.
http://effbot.org/zone/element-tidylib.htm

An ideal package would be one that parses real-world HTML and has full XPath support, but I don't know of such a thing...maybe amara or lxml?

Kent