[ python-Bugs-761452 ] HTMLParser chokes on my.yahoo.com output
SourceForge.net
noreply at sourceforge.net
Fri Sep 2 02:04:54 CEST 2005
Bugs item #761452, was opened at 2003-06-26 21:11
Message generated for change (Comment added) made by tmick
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=761452&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.2.3
Status: Closed
Resolution: Accepted
Priority: 5
Submitted By: Robert Walsh (rjwalsh)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser chokes on my.yahoo.com output
Initial Comment:
The HTML parser chokes on the output produced by
http://my.yahoo.com/. The problem appears to be that
the HTML Yahoo is producing contains stuff like this:
<option foo bar=>
The bar= without any value causes HTMLParser to get
confused. I made the following patch to HTMLParser.py
and everything is now happy. This may be illegal HTML,
but it appears to be popular. Basically, this patch
tells it that the part after the = is optional.
--- HTMLParser.py.orig 2003-06-26 14:05:07.670049324 -0700
+++ HTMLParser.py 2003-06-26 14:05:14.440298260 -0700
@@ -36,7 +36,7 @@
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
- )
+ )?
)?
)
)*
----------------------------------------------------------------------
>Comment By: Trent Mick (tmick)
Date: 2005-09-02 00:04
Message:
Logged In: YES
user_id=34892
...and subsequently backed out in r1.15.2.2 and r1.17.
Reverting previous checkin. This breaks too much of
HTMLParser to be applied without thought. Anyway, such
malformed HTML is better handled by something
like BeautifulSoup.
Apologies, Reinhold, if you were getting to this. I just
happened to notice this while reading python-checkins. Cheers.
----------------------------------------------------------------------
Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-08-31 22:09
Message:
Logged In: YES
user_id=1188172
Checked in as Lib/HTMLParser.py r1.16, 1.15.2.1.
----------------------------------------------------------------------
Comment By: Robert Walsh (rjwalsh)
Date: 2005-06-01 20:52
Message:
Logged In: YES
user_id=608672
Crap. Stupid SourceForge bug tracker puts the latest stuff
on top - I was replying to the wrong one. The change can be
applied, in my opinion.
----------------------------------------------------------------------
Comment By: Robert Walsh (rjwalsh)
Date: 2005-06-01 20:51
Message:
Logged In: YES
user_id=608672
It's been so long since I looked at this, I don't believe I
even have the code any more. It's just a one-character
change, though - can you recreate it yourself by just adding
the ? character to the end of line 39 in HTMLParser.py.
Unless it's moved in the meantime, of course.
----------------------------------------------------------------------
Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-01 12:24
Message:
Logged In: YES
user_id=1188172
Should it be applied, then?
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2003-06-30 15:49
Message:
Logged In: YES
user_id=6380
Here it is (a one-char change). Looks harmless to me.
----------------------------------------------------------------------
Comment By: Neal Norwitz (nnorwitz)
Date: 2003-06-27 02:55
Message:
Logged In: YES
user_id=33168
It's difficult to read the patch as posted since whitespace
is lost. Please attach the patch as a file. Thanks.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=761452&group_id=5470
More information about the Python-bugs-list
mailing list