[Web-SIG] HTML parsing - get text position and font size
Manlio Perillo
manlio_perillo at libero.it
Mon Jan 12 15:11:04 CET 2009
Girish Redekar ha scritto:
> I'm trying to build a search engine in python am stuck at the place
> where I parse HTML to get useful text. One should ideally be able to
> parse the text (out of HTML tags) along with its position (for phrase
> searches) and font-size (to weigh words appropriately).
>
Words weight should be done using semantics, not style.
However, if you really need it, for CSS parsing, there is cssutils package.
I'm writing a CSS parser, too:
http://hg.mperillo.ath.cx/pdfimg/file/tip/pdfimg/style/css/
using PLY, so it should easy to read/modify.
It is still in very early stage.
> [...]
Regards Manlio Perillo
More information about the Web-SIG
mailing list