[Web-SIG] HTML parsing - get text position and font size
noah.gift at gmail.com
Mon Jan 12 12:29:11 CET 2009
2009/1/13 Girish Redekar <girish.redekar at gmail.com>:
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).
> However, this part gets very tedious (especially with bad html and css) and
> my code is already unwieldy. It seems to me that this task should've been a
> part of any python based semi-sophisticated screen scraper and that it would
> be a commonly solved problem. Yet, no amount of googling has returned
> anything useful.
> Any ideas?
I wrote this article a way back:
I didn't fully explore it, but it seems like thread pools and
Beautiful Soup could work...
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
More information about the Web-SIG