Readability (html purifier) in Python
Дамјан Георгиевски
gdamjan at gmail.com
Wed Jun 16 15:51:21 EDT 2010
>> http://lab.arc90.com/experiments/readability/
>>
>> Readability is a javascript bookmarklet that "makes reading on the
>> Web more enjoyable by removing the clutter around what you're
>> reading."
>>
>> Does anyone know of something similar in Python?
>
> Well, that sounds like a browser tool.
yes, it's a bookmarklet, a tiny javascript code that when clicked runs
on the current document in the browser.
> Could you be a bit more specific about what kind of "similar"
> functionality you would expect from a "similar" Python tool?
> How would you tell it "what you're reading", for example?
I'm not sure I understand your question corectly, but anyway.
What I need is a package that given a random html document (a page from
any random website) would extract the meaningful content, and filter the
junk (advertisments, non-content elements, any other UI etc.)
Readability seems to do some herustictical manipulation of the DOM, but
I'm not that good at reading/understanding it's source-code. Of course
it can't be 100% correct, but it's good enough in many cases.
http://code.google.com/p/arc90labs-
readability/source/browse/trunk/js/readability.js
--
дамјан ((( http://damjan.softver.org.mk/ )))
war is peace
freedom is slavery
restrictions are enablement
More information about the Python-list
mailing list