BeautifulSoup vs. Microsoft

John Nagle nagle at
Thu Mar 29 08:50:49 CEST 2007

Here's a construct with which BeautifulSoup has problems.  It's
from "".

This is the original:

<a href=""
     onclick="return MS_HandleClick(this,'C_32179', true);">
     Help us improve our products

And this is what comes back after parsing with BeautifulSoup
and using "prettify":

                     <a href="" 
                      <br clear="all" style="line-height: 1px; overflow: hidden" />
                      <table id="msviFooter" width="100%" cellpadding="0" 
                       <tr valign="bottom">

                        <td id="msviFooter2" 
endColorStr='#3F8CDA', gradientType='1')">
                         <div id="msviLocalFooter">

All that other stuff is in the neighborhood, but not in that <a> tag.

Strictly speaking, it's Microsoft's fault.


is supposed to be an HTML comment.  But it's improperly terminated.
It should end with "-->".  So all that following stuff is from what
follows the next "-->" which terminates a comment.

It's so Microsoft.

Unfortunately, even Firefox accepts bad comments like that.

Anyway, a BeautifulSoup question.  "findall(text=True)" collects comments,
processing instructions, etc. as well as real text.  What's the right way
to collect ordinary text only?

					John Nagle

More information about the Python-list mailing list