Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way.<br><br>I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each text element. However, getting from here to a simple list of the form: <br>
[ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3, fontsize3, position3) ... ]<br>is still tedious as font sizes in html/css can be expressed in multiple methods (like <FONT> tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has already (and reliably) worked on the same<br>
<br>Thanks,<br>Girish<br><br><br><div class="gmail_quote">On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift <span dir="ltr"><<a href="mailto:noah.gift@gmail.com">noah.gift@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
2009/1/13 Girish Redekar <<a href="mailto:girish.redekar@gmail.com">girish.redekar@gmail.com</a>>:<br>
<div><div></div><div class="Wj3C7c">> I'm trying to build a search engine in python am stuck at the place where I<br>
> parse HTML to get useful text. One should ideally be able to parse the text<br>
> (out of HTML tags) along with its position (for phrase searches) and<br>
> font-size (to weigh words appropriately).<br>
><br>
> However, this part gets very tedious (especially with bad html and css) and<br>
> my code is already unwieldy. It seems to me that this task should've been a<br>
> part of any python based semi-sophisticated screen scraper and that it would<br>
> be a commonly solved problem. Yet, no amount of googling has returned<br>
> anything useful.<br>
><br>
> Any ideas?<br>
<br>
</div></div>I wrote this article a way back:<br>
<br>
<a href="http://www.ibm.com/developerworks/aix/library/au-threadingpython/" target="_blank">http://www.ibm.com/developerworks/aix/library/au-threadingpython/</a><br>
<br>
I didn't fully explore it, but it seems like thread pools and<br>
Beautiful Soup could work...<br>
<br>
<br>
> _______________________________________________<br>
> Web-SIG mailing list<br>
> <a href="mailto:Web-SIG@python.org">Web-SIG@python.org</a><br>
> Web SIG: <a href="http://www.python.org/sigs/web-sig" target="_blank">http://www.python.org/sigs/web-sig</a><br>
> Unsubscribe:<br>
> <a href="http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com" target="_blank">http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com</a><br>
><br>
><br>
</blockquote></div><br>