BeautifulSoup vs. Microsoft
Paul McGuire
ptmcg at austin.rr.com
Thu Mar 29 10:06:17 EDT 2007
On Mar 29, 1:50 am, John Nagle <n... at animats.com> wrote:
> Here's a construct with which BeautifulSoup has problems. It's
> from "http://support.microsoft.com/contactussupport/?ws=support".
>
> This is the original:
>
> <a href="http://www.microsoft.com/usability/enroll.mspx"
> id="L_75998"
> title="<!--http://www.microsoft.com/usability/information.mspx->"
> onclick="return MS_HandleClick(this,'C_32179', true);">
> Help us improve our products
> </a>
>
<snip>
>
> Strictly speaking, it's Microsoft's fault.
>
> title="<!--http://www.microsoft.com/usability/information.mspx->"
>
> is supposed to be an HTML comment. But it's improperly terminated.
> It should end with "-->". So all that following stuff is from what
> follows the next "-->" which terminates a comment.
>
No, that comment is inside a quoted string, so it should be ok.
If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:
import urllib
from pyparsing import makeHTMLTags
pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()
# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]
for a in anchorTag.searchString(htmlSrc):
if "title" in a:
print "Title:", a.title
print "HREF:", a.href
# or use this statement to dump the complete tag contents
# print a.dump()
print
Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx
Title: Print this page
HREF: /gp/noscript/
Title: Print this page
HREF: /gp/noscript/
Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport
Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport
Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0
Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0
Title: Save to My Support Favorites
HREF: /gp/noscript/
Title: Save to My Support Favorites
HREF: /gp/noscript/
Title: Go to My Support Favorites
HREF: /gp/noscript/
Title: Go to My Support Favorites
HREF: /gp/noscript/
Title: Send Feedback
HREF: /gp/noscript/
Title: Send Feedback
HREF: /gp/noscript/
-- Paul
More information about the Python-list
mailing list