[ python-Bugs-1055864 ] HTMLParser not compliant to XHTML spec

Thu Oct 28 21:41:48 CEST 2004

Bugs item #1055864, was opened at 2004-10-28 06:59
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Luke Bradley (neptune235)
Assigned to: Nobody/Anonymous (nobody)
Summary: HTMLParser not compliant to XHTML spec

Initial Comment:
HTMLParser has a problem related to the fact that is
doesn't seem to comply to the spec for XHTML. What I am
refering to can be read about here:
http://www.w3.org/TR/xhtml1/#h-4.8
In a nutshell, HTMLParser doesn't treat data inside
'script' or 'style' elements as #PCDATA, but rather
behaves like an HTML 4 parser even for XHTML documents,
parsing only end tags. As a result, entity references
in javascript are not converted as they should be.
XHTML authors writing to spec can expect entities in
script sections of XHTML documents to be converted if
the script is not explicitly escaped as a CDATA
section. which brings up problem two, That sections
explicitly escaped as CDATA are also parsed as HTML 4
'script' and 'style' sections...End tags are parsed...
My understanding is that this is bad as well:
http://www.w3.org/TR/2004/REC-xml-20040204/#dt-cdsection
because CDend is the only thing that's supposed to be
parsed in a CDATA section for all XML documents?

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2004-10-28 21:41

Message:
Logged In: YES 
user_id=21627

Can you give an example demonstrating this problem, please?
A Python script with a small embedded HTML file, and a
PASS/FAIL condition would be best.

----------------------------------------------------------------------

Comment By: Luke Bradley (neptune235)
Date: 2004-10-28 10:31

Message:
Logged In: YES 
user_id=178561

I also reported bug 1051840. I discovered this when I was
looking for a universal way to handle all the wierd things
people do with their script sections on HTML/XHTML pages on
the net. I've ended up modifying HTMLParser.py so that the
HTMLParser class has an extra attribute called last_match,
which is the exact string of HTML that whatever handler
event  is being called for...So that putting:
sys.stdout.write(self.last_match) 
or
sys.stdout.write(self.get_last_match())
for every handler event (except handle_data, which can be
directly outputted) will output the page exactly as was
inputted. This allows me to handle all oddities in people's
code at the level of my handler, without changing HTMLParser
in any other way...
Here's the code, attached. Not that you care, but on the off
chance that you guys might want to think about doing
something like this....:)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470