[Tutor] filtering a webpage for plucking to a Palm
Kent Johnson
kent37 at tds.net
Sun Jun 26 15:31:25 CEST 2005
Brian van den Broek wrote:
> Hi all,
>
> I have a Palm handheld, and use the excellent (and written in Python)
> Plucker <http://www.plkr.org/> to spider webpages and format the
> results for viewing on the Palm.
>
> One site I 'pluck' is the Daily Python URL
> <http://www.pythonware.com/daily/>. From the point of view of a daily
> custom 'newspaper' everything but the last day or two of URLs is so
> much cruft. (The cruft would be the total history of the last
> seven'ish days, the navigation links for www.pythonware.com, etc.)
>
> Today, I wrote a script to parse the Daily URL, and create a minimal
> local html page including nothing but the last n items, n links, or
> last n days worth of links. (Which is employed is a user option.)
> Then, I pluck that, rather than the actual Daily URL site. Works
> great. :-) (If anyone on the list is a fellow plucker'er and would be
> interested in my script, I'm happy to share.)
>
> In anticipation of wanting to do the same thing to other sites, I've
> spent a bit of time abstracting it. I've made some real progress. But,
> before I finish up, I've a voice in the back of my head asking if
> maybe I'm re-inventing the wheel.
>
> To my shame, I've not spent very much time at all exploring available
> frameworks and modules for any domain, and almost none for web-related
> tasks. So, does anyone know of any modules or frameworks which would
> make the sort of task I am describing easier?
>
> The difficulty in making my routine general is that pretty much each
> site will need its own code for identifying what counts as a distinct
> item (such as a URL and its description in the Daily URL) and what
> counts as a distinct block of items (such as a days worth of Daily URL
> items). I can't imagine there's a way around that, but if someone else
> has done much of the work in setting up the general structure to be
> tweaked for each site, that'd be good to know. (Doesn't feel like one
> that would be googleable.)
Beautiful Soup can help with parsing and accessing the web page. You could certainly write your plucker on top of it.
http://www.crummy.com/software/BeautifulSoup/
Alternately ElementTidy might help. It can parse web pages and it has limited XPath support. XPath might be a good language for expressing your plucking rules.
http://effbot.org/zone/element-tidylib.htm
An ideal package would be one that parses real-world HTML and has full XPath support, but I don't know of such a thing...maybe amara or lxml?
Kent
More information about the Tutor
mailing list