[XML-SIG] unicode entity refs

A.M. Kuchling akuchlin@cnri.reston.va.us
Wed, 5 May 1999 21:09:03 -0400

Jeff.Johnson@icn.siemens.com writes:
 >Sorry to be a pest but I never got a response on the following email and was
 >hoping someone had an answer as to why unicode entity refs dissapear in PyDom.

I finally got around to looking at this tonight while cleaning out my
mailbox.  The HTML parser is actually choking on the character
reference, but the error handler is, surprise surprise, not doing
anything.  The fix is to add an error handler, as in the patch below.

	However, this doesn't fix your problem, since the error
handler raises a BadHTML exception.  I'd argue for this behaviour,
since the HTML character set is ISO-whatever, not Unicode, and
therefore this is illegal HTML; if it's got character references >255,
it's not HTML but XML that looks like HTML.  (Hmm... I may have
written too soon; what's the status of HTML i18n?  Can you declare a
Unicode encoding for an HTML document?)

	On a side note, the Unicode issue seems to be heading for
using /F's Unicode type.  This would seem to be a good argument to
drop MvL's Unicode type, which is currently in the XML tree, and
replace it with /F's code.  Opinions?

A.M. Kuchling			http://starship.python.net/crew/amk/
Surely where there's smoke there's fire? No, where there's so much smoke
there's smoke.
    -- John A. Wheeler

Index: html_builder.py
RCS file: /home/cvsroot/xml/dom/html_builder.py,v
retrieving revision 1.9
diff -C2 -r1.9 html_builder.py
*** html_builder.py	1999/03/09 00:57:11	1.9
--- html_builder.py	1999/05/06 01:03:33
*** 100,103 ****
--- 100,106 ----
+     def unknown_charref(self, ref):
+ 	raise BadHTMLError, ('Unknown character reference: &#' + ref + ';')
      def handle_data(self, s):
  	#print `s`