[XML-SIG] HtmlBuilder

Jeff.Johnson@icn.siemens.com Jeff.Johnson@icn.siemens.com
Thu, 4 Mar 1999 18:32:13 -0500


--0__=sXY2JC4AkSF7bAz8mYHxK6g8BPRmGdzByc5nbgIpNdDhSRqCCg1doDbB
Content-type: text/plain; charset=us-ascii
Content-Disposition: inline



Hi all,

I use a program for which I do not have the source code to convert RTF to
HTML.  I then use xml.dom to reformat it, add navigation bars, fix links,
etc.  Very rarely, the RTF to HTML converter will throw a </A> into a
document without a preceding <A>.  This causes HtmlBuilder to start popping
elements off its stack while looking for the starting <A>, including <BODY>
and <HTML>.  When it runs out of stack it happily continues and the DOM is
created.  Unfortunately, each following element that should have been a
child of <BODY> then becomes a sibling of the <HTML> document element.
This produces an invalid DOM document but no exceptions are thrown.
Eventually I do something with the DOM that calls the method
Node.get_documentElement which raises a HierarchyRequestException because
there is more than one root element.

Since I don't have the source code the RTF to HTML converter, I can't fix
it.  I did however add two lines of code to HtmlBuilder that will allow
these bogus end tags to be ignored.  I hope that the new code can be added
to the CVS tree.

If this were an XML document I would rather raise an exception and reject
the document. XML should be perfectly well formed.  Since it is an HTML
document I am more inclined to fix what can be fixed logically because
there are already so many invalid HTML files in the world.  I often run
into this same problem when processing hand made HTML files.  This
modification might allow the XML-SIG DOM implementation to be used to clean
up the existing HTML mess.

I added the following three lines:

                        if tag not in self.stack:
                                #print "ignoring end tag with no start",
tag
                                break

to the following method of HtmlBuilder:

        def unknown_endtag(self, tag):
                tag = string.upper(tag)
                #print 'ending', tag

                while self.stack:
                        if tag not in self.stack:
                                #print "ignoring end tag with no start",
tag
                                break
                        if tag in self.empties:
                                continue
                        start_tag = self.stack[-1]
                        del self.stack[-1]
                        Builder.endElement(self, start_tag)
                        if start_tag == tag:
                                break

The entire file is attached:
(See attached file: html_builder.py)

Cheers,
Jeff

--0__=sXY2JC4AkSF7bAz8mYHxK6g8BPRmGdzByc5nbgIpNdDhSRqCCg1doDbB
Content-type: application/octet-stream; 
	name="html_builder.py"
Content-Disposition: attachment; filename="html_builder.py"
Content-transfer-encoding: base64

JycnSFRNTCBwYXJzZXIsIGJ1aWx0IGZyb20gc3RhbmRhcmQgbGliJ3Mgc2dtbGxpYi4NCg0KVGFn
IG5hbWVzIGFyZSBub3JtYWxpc2VkIHRvIHVwcGVyIGNhc2UsIHRoZSB1c3VhbCBIVE1MIGZhc2hp
b24uDQonJycNCg0KZnJvbSBzZ21sbGliIGltcG9ydCBTR01MUGFyc2VyDQpmcm9tIHhtbC5kb20g
aW1wb3J0IGNvcmUNCmZyb20geG1sLmRvbS5idWlsZGVyIGltcG9ydCBCdWlsZGVyDQppbXBvcnQg
c3RyaW5nDQoNCmNsYXNzIEh0bWxCdWlsZGVyKFNHTUxQYXJzZXIsIEJ1aWxkZXIpOg0KICAgICAg
ICBmcm9tIGh0bWxlbnRpdHlkZWZzIGltcG9ydCBlbnRpdHlkZWZzDQogICAgICAgIA0KICAgICAg
ICBkZWYgX19pbml0X18oc2VsZik6DQogICAgICAgICAgICAgICAgU0dNTFBhcnNlci5fX2luaXRf
XyhzZWxmKQ0KICAgICAgICAgICAgICAgIEJ1aWxkZXIuX19pbml0X18oc2VsZikNCg0KICAgICAg
ICAgICAgICAgIHNlbGYuZW1wdGllcyA9IFsNCiAgICAgICAgICAgICAgICAgICAgICAgICdNRVRB
JywgJ0JBU0UnLCAnTElOSycsIA0KICAgICAgICAgICAgICAgICAgICAgICAgJ0hSJywgJ0JSJywN
CiAgICAgICAgICAgICAgICAgICAgICAgICdJTUcnLCAnUEFSQU0nLA0KICAgICAgICAgICAgICAg
ICAgICAgICAgJ0lOUFVUJywgJ09QVElPTicsICdJU0lOREVYJw0KICAgICAgICAgICAgICAgIF0N
CiAgICAgICAgICAgICAgICBsaXN0ID0gKCdPTCcsICdVTCcsICdETCcpDQogICAgICAgICAgICAg
ICAgaGVhZGluZyA9ICgnSDEnLCAnSDInLCAnSDMnLCAnSDQnLCAnSDUnLCAnSDYnKQ0KICAgICAg
ICAgICAgICAgIGJsb2NrcyA9ICgnUCcsICdBRERSRVNTJywgJ0JMT0NLUVVPVEUnLCAnRk9STScs
ICdUQUJMRScsICdQUkUnKSArIFwNCiAgICAgICAgICAgICAgICAgICAgICAgIGhlYWRpbmcgIyAr
IGxpc3QNCiAgICAgICAgICAgICAgICBzZWxmLmluZmVyX2VuZHMgPSB7DQogICAgICAgICAgICAg
ICAgICAgICAgICAnUCc6IGJsb2NrcywNCg0KICAgICAgICAgICAgICAgICAgICAgICAgJ0xJJzog
KCdMSScsKSwNCiAgICAgICAgICAgICAgICAgICAgICAgICdEVCc6ICgnRFQnLCksDQogICAgICAg
ICAgICAgICAgICAgICAgICAnREQnOiAoJ0RUJywgJ0REJyksDQoNCiAgICAgICAgICAgICAgICAg
ICAgICAgICdUUic6ICgnVFInLCksIA0KICAgICAgICAgICAgICAgICAgICAgICAgJ1RIJzogKCdU
SCcsICdURCcsICdUUicpLA0KICAgICAgICAgICAgICAgICAgICAgICAgJ1REJzogKCdUSCcsICdU
RCcsICdUUicpLA0KICAgICAgICAgICAgICAgIH0NCg0KICAgICAgICANCiAgICAgICAgZGVmIHVu
a25vd25fc3RhcnR0YWcoc2VsZiwgdGFnLCBhdHRycyk6DQogICAgICAgICAgICAgICAgdGFnID0g
c3RyaW5nLnVwcGVyKHRhZykNCiAgICAgICAgICAgICAgICAjcHJpbnQgJ3N0YXJ0aW5nJywgdGFn
DQogICAgICAgICAgICAgICAgYXR0cmlidXRlcyA9IHt9DQogICAgICAgICAgICAgICAgZm9yIGss
IHYgaW4gYXR0cnM6DQogICAgICAgICAgICAgICAgICAgICAgICBhdHRyaWJ1dGVzW3N0cmluZy51
cHBlcihrKV0gPSB2DQoNCiAgICAgICAgICAgICAgICAjcHJpbnQgc2VsZi5zdGFjaw0KICAgICAg
ICAgICAgICAgIHdoaWxlIHNlbGYuc3RhY2s6DQogICAgICAgICAgICAgICAgICAgICAgICBpZiBz
ZWxmLmluZmVyX2VuZHMuaGFzX2tleShzZWxmLnN0YWNrWy0xXSk6IA0KICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICBpZiB0YWcgaW4gc2VsZi5pbmZlcl9lbmRzW3NlbGYuc3RhY2tbLTFd
XToNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAjcHJpbnQgdGFnLCAn
ZW5kaW5nJywgc2VsZi5zdGFja1stMV0NCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICBCdWlsZGVyLmVuZEVsZW1lbnQoc2VsZiwgc2VsZi5zdGFja1stMV0pDQogICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGVsIHNlbGYuc3RhY2tbLTFdDQogICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgI3ByaW50IHNlbGYuc3RhY2sNCiAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZWxzZToNCiAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICBicmVhaw0KICAgICAgICAgICAgICAgICAgICAgICAgZWxzZToN
CiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYnJlYWsNCiMgICAgICAgICAgICAgICBw
cmludCBzZWxmLnN0YWNrLCB0YWcsIGF0dHJpYnV0ZXMNCiAgICAgICAgICAgICAgICANCiAgICAg
ICAgICAgICAgICBCdWlsZGVyLnN0YXJ0RWxlbWVudChzZWxmLCB0YWcsIGF0dHJpYnV0ZXMpDQog
ICAgICAgICAgICAgICAgaWYgbm90IHRhZyBpbiBzZWxmLmVtcHRpZXM6DQogICAgICAgICAgICAg
ICAgICAgICAgICBzZWxmLnN0YWNrLmFwcGVuZCh0YWcpDQogICAgICAgICAgICAgICAgZWxzZToN
CiAgICAgICAgICAgICAgICAgICAgICAgIEJ1aWxkZXIuZW5kRWxlbWVudChzZWxmLCB0YWcpDQoN
Cg0KICAgICAgICBkZWYgdW5rbm93bl9lbmR0YWcoc2VsZiwgdGFnKToNCiAgICAgICAgICAgICAg
ICB0YWcgPSBzdHJpbmcudXBwZXIodGFnKQ0KICAgICAgICAgICAgICAgICNwcmludCAnZW5kaW5n
JywgdGFnDQoNCiAgICAgICAgICAgICAgICB3aGlsZSBzZWxmLnN0YWNrOg0KICAgICAgICAgICAg
ICAgICAgICAgICAgaWYgdGFnIG5vdCBpbiBzZWxmLnN0YWNrOg0KICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAjcHJpbnQgImlnbm9yaW5nIGVuZCB0YWcgd2l0aCBubyBzdGFydCIsIHRh
Zw0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBicmVhaw0KICAgICAgICAgICAgICAg
ICAgICAgICAgaWYgdGFnIGluIHNlbGYuZW1wdGllczoNCiAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgY29udGludWUNCiAgICAgICAgICAgICAgICAgICAgICAgIHN0YXJ0X3RhZyA9IHNl
bGYuc3RhY2tbLTFdDQogICAgICAgICAgICAgICAgICAgICAgICBkZWwgc2VsZi5zdGFja1stMV0N
CiAgICAgICAgICAgICAgICAgICAgICAgIEJ1aWxkZXIuZW5kRWxlbWVudChzZWxmLCBzdGFydF90
YWcpDQogICAgICAgICAgICAgICAgICAgICAgICBpZiBzdGFydF90YWcgPT0gdGFnOg0KICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICBicmVhaw0KDQogICAgICAgIGRlZiBoYW5kbGVfZGF0
YShzZWxmLCBzKToNCiAgICAgICAgICAgICAgICAjcHJpbnQgYHNgDQogICAgICAgICAgICAgICAg
QnVpbGRlci50ZXh0KHNlbGYsIHMpDQoNCiAgICAgICAgZGVmIGhhbmRsZV9jb21tZW50KHNlbGYs
IHMpOg0KICAgICAgICAgICAgICAgIEJ1aWxkZXIuY29tbWVudChzZWxmLCBzKQ0KDQoNCiMgVGVz
dC4NCmlmIF9fbmFtZV9fID09ICdfX21haW5fXyc6DQogICAgICAgIGltcG9ydCBzeXMNCiAgICAg
ICAgYiA9IEh0bWxCdWlsZGVyKCkNCiAgICAgICAgYi5mZWVkKG9wZW4oc3lzLmFyZ3ZbMV0pLnJl
YWQoKSkNCiAgICAgICAgYi5jbG9zZSgpDQojICAgICAgIHByaW50IGIuZG9jdW1lbnQNCiMgICAg
ICAgcHJpbnQgYi5kb2N1bWVudC5kb2N1bWVudEVsZW1lbnQNCg0KICAgICAgICBmcm9tIHdyaXRl
ciBpbXBvcnQgSHRtbExpbmVhcmlzZXINCiAgICAgICAgdyA9IEh0bWxMaW5lYXJpc2VyKCkNCiAg
ICAgICAgcHJpbnQgdy5saW5lYXJpc2UoYi5kb2N1bWVudCkNCg0K

--0__=sXY2JC4AkSF7bAz8mYHxK6g8BPRmGdzByc5nbgIpNdDhSRqCCg1doDbB--