[XML-SIG] unicode entity refs

Jeff.Johnson@icn.siemens.com Jeff.Johnson@icn.siemens.com
Thu, 6 May 1999 11:38:53 -0400


A.M. Kuchling writes:
>I finally got around to looking at this tonight while cleaning out my
>mailbox.  The HTML parser is actually choking on the character
>reference, but the error handler is, surprise surprise, not doing
>anything.  The fix is to add an error handler, as in the patch below.

Good timing, I put this on the back burner until yesterday when I made a few
subclasses to preserve the charrefs... I have no idea if this is the best way
but here it is anyway :)  (gLog is a global error logging class)

from xml.dom.builder import Builder
from xml.dom.html_builder import HtmlBuilder
from xml.dom.utils import FileReader

class MyHtmlBuilder(HtmlBuilder):
    def unknown_charref(self, ref):
        gLog.Warning('unknown_charref %s' % ref)
     Builder.entityref(self, '#' + ref)
    def unknown_entityref(self, ref):
        gLog.Error('unknown_entityref %s' % ref)

class MyFileReader(FileReader):
    def readHtml(self,stream):
        b = MyHtmlBuilder()
        b.feed(stream.read())
        b.close()
        return b.document





"A.M. Kuchling" <amk1@erols.com> on 05/05/99 09:09:03 PM

Please respond to akuchlin@cnri.reston.va.us

To:   Jeff Johnson/Service/ICN
cc:   xml-sig@python.org
Subject:  [XML-SIG] unicode entity refs




Jeff.Johnson@icn.siemens.com writes:
 >Sorry to be a pest but I never got a response on the following email and was
 >hoping someone had an answer as to why unicode entity refs dissapear in PyDom.

I finally got around to looking at this tonight while cleaning out my
mailbox.  The HTML parser is actually choking on the character
reference, but the error handler is, surprise surprise, not doing
anything.  The fix is to add an error handler, as in the patch below.

     However, this doesn't fix your problem, since the error
handler raises a BadHTML exception.  I'd argue for this behaviour,
since the HTML character set is ISO-whatever, not Unicode, and
therefore this is illegal HTML; if it's got character references >255,
it's not HTML but XML that looks like HTML.  (Hmm... I may have
written too soon; what's the status of HTML i18n?  Can you declare a
Unicode encoding for an HTML document?)

     On a side note, the Unicode issue seems to be heading for
using /F's Unicode type.  This would seem to be a good argument to
drop MvL's Unicode type, which is currently in the XML tree, and
replace it with /F's code.  Opinions?

--
A.M. Kuchling            http://starship.python.net/crew/amk/
Surely where there's smoke there's fire? No, where there's so much smoke
there's smoke.
    -- John A. Wheeler


Index: html_builder.py
===================================================================
RCS file: /home/cvsroot/xml/dom/html_builder.py,v
retrieving revision 1.9
diff -C2 -r1.9 html_builder.py
*** html_builder.py 1999/03/09 00:57:11  1.9
--- html_builder.py 1999/05/06 01:03:33
***************
*** 100,103 ****
--- 100,106 ----
          break

+     def unknown_charref(self, ref):
+    raise BadHTMLError, ('Unknown character reference: &#' + ref + ';')
+
      def handle_data(self, s):
     #print `s`



_______________________________________________
XML-SIG maillist  -  XML-SIG@python.org
http://www.python.org/mailman/listinfo/xml-sig