[Web-SIG] HTML parsing - get text position and font size

Manlio Perillo manlio_perillo at libero.it
Mon Jan 12 15:11:04 CET 2009


Girish Redekar ha scritto:
> I'm trying to build a search engine in python am stuck at the place 
> where I parse HTML to get useful text. One should ideally be able to 
> parse the text (out of HTML tags) along with its position (for phrase 
> searches) and font-size (to weigh words appropriately).
> 

Words weight should be done using semantics, not style.

However, if you really need it, for CSS parsing, there is cssutils package.
I'm writing a CSS parser, too:
http://hg.mperillo.ath.cx/pdfimg/file/tip/pdfimg/style/css/

using PLY, so it should easy to read/modify.
It is still in very early stage.



 > [...]


Regards  Manlio Perillo


More information about the Web-SIG mailing list