[Web-SIG] HTML parsing - get text position and font size
girish.redekar at gmail.com
Mon Jan 12 13:07:37 CET 2009
Thanks Noah - Beautiful Soup does give a tree that can be used - however,
getting from the tree to the result I desire is still a long way.
I'm using lxml (for speed conerns) and it also returns a tree similar to BS
.. I have even got as far as parsing the css and getting the attributes for
each text element. However, getting from here to a simple list of the form:
[ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3,
fontsize3, position3) ... ]
is still tedious as font sizes in html/css can be expressed in multiple
methods (like <FONT> tags, sizes in pixels, relative sizes, default larger
size for header etc). One can get down and code each of these cases, but I
was hoping someone has already (and reliably) worked on the same
On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift <noah.gift at gmail.com> wrote:
> 2009/1/13 Girish Redekar <girish.redekar at gmail.com>:
> > I'm trying to build a search engine in python am stuck at the place where
> > parse HTML to get useful text. One should ideally be able to parse the
> > (out of HTML tags) along with its position (for phrase searches) and
> > font-size (to weigh words appropriately).
> > However, this part gets very tedious (especially with bad html and css)
> > my code is already unwieldy. It seems to me that this task should've been
> > part of any python based semi-sophisticated screen scraper and that it
> > be a commonly solved problem. Yet, no amount of googling has returned
> > anything useful.
> > Any ideas?
> I wrote this article a way back:
> I didn't fully explore it, but it seems like thread pools and
> Beautiful Soup could work...
> > _______________________________________________
> > Web-SIG mailing list
> > Web-SIG at python.org
> > Web SIG: http://www.python.org/sigs/web-sig
> > Unsubscribe:
> > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Web-SIG