BeautifulSoup vs. Microsoft

John Nagle nagle at animats.com
Thu Mar 29 08:50:49 CEST 2007


Here's a construct with which BeautifulSoup has problems.  It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:


<a href="http://www.microsoft.com/usability/enroll.mspx"
     id="L_75998"
     title="<!--http://www.microsoft.com/usability/information.mspx->"
     onclick="return MS_HandleClick(this,'C_32179', true);">
     Help us improve our products
</a>


And this is what comes back after parsing with BeautifulSoup
and using "prettify":


                     <a href="http://www.microsoft.com/usability/enroll.mspx" 
id="L_75998" 
title="&lt;!--http://www.microsoft.com/usability/information.mspx-&gt;">
                      <br clear="all" style="line-height: 1px; overflow: hidden" />
                      <table id="msviFooter" width="100%" cellpadding="0" 
cellspacing="0">
                       <tr valign="bottom">

                        <td id="msviFooter2" 
style="filter:progid:DXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF', 
endColorStr='#3F8CDA', gradientType='1')">
                         <div id="msviLocalFooter">
                          <nobr>
                          </nobr>
                         </div>
                        </td>
                       </tr>
                      </table>
                     </a>

All that other stuff is in the neighborhood, but not in that <a> tag.

Strictly speaking, it's Microsoft's fault.

     title="<!--http://www.microsoft.com/usability/information.mspx->"

is supposed to be an HTML comment.  But it's improperly terminated.
It should end with "-->".  So all that following stuff is from what
follows the next "-->" which terminates a comment.

It's so Microsoft.

Unfortunately, even Firefox accepts bad comments like that.

Anyway, a BeautifulSoup question.  "findall(text=True)" collects comments,
processing instructions, etc. as well as real text.  What's the right way
to collect ordinary text only?

					John Nagle






More information about the Python-list mailing list