[XML-SIG] Handling of character entity references
Mike Brown
mike@skew.org
Tue, 27 May 2003 03:40:00 -0600
Tamito KAJIYAMA wrote:
>Mike Brown <mike@skew.org> writes:
>|
>| You said you're using SAX to produce HTML from XML, so I assume the XML
>| parser is calling the event handler methods in your ContentHandler. When
>| ContentHandler.characters() is called by the parser to notify your
>| application about character data, a Unicode string is passed as the
>| content argument (as long as expat is your underlying parser). This is
>| probably not how it worked when your application was originally written,
>| prior to the omnipresence of Unicode in Python.
>|
>| Whatever mechanism you are using to produce HTML (I'm not going to guess
>| how you're doing that) will be running the Unicode string through an
>| encoder, perhaps just using the built-in encode() method on the Unicode
>| string object, to produce EUC-JP or ISO-2022-JP byte strings for output.
>|
>| Of course this isn't automatic, but my point is that (hopefully) your
>| HTML-producing SAX application will be written (by you) such that it
>| does do the encoding (at the last step before output, preferably), and
>| will be smart enough (because you wrote it that way) to write character
>| references when the codec doesn't handle a particular Unicode character.
>
>Hmm, I've still missed the point. Do you mean that there is a
>codec with an error handling scheme that translates undefined
>characters into appropriate character references?
>
No.
> The SAX
>application of mine does not have (of course) such a mechanism
>that would "write character references when the codec doesn't
>handle a particular Unicode character," since it does not rely
>on Unicode support at all.
>
>
And that's a problem now, as your script below demonstrates
>Let me make the discussion a bit more concrete. The following
>script effectively reproduces the problem that I encountered.
>(I rewrote the same conversion logic in SAX2.)
>
>--------------------------------------------------------------
>import StringIO, string
>
>from xml.sax import saxutils, sax2exts
>
>class MyHandler(saxutils.DefaultHandler):
> def characters(self, content):
> if string.strip(content):
> print "DEBUG:", repr(content)
> print content
>
>DOC = """\
><?xml version="1.0" encoding="EUC-JP" ?>
><!DOCTYPE doc [
><!ENTITY eacute "é">
>]>
><doc>
><p>Isto é uma caneta.</p>
><p>\244\263\244\354\244\317\245\332\245\363\244\307\244\271\241\243</p>
></doc>
>"""
>
>parser = sax2exts.make_parser(["xml.sax.drivers2.drv_xmlproc"])
>parser.setContentHandler(MyHandler())
>parser.parse(StringIO.StringIO(DOC))
>--------------------------------------------------------------
>
>In Python 1.5.2, the method ContentHandler.characters() ends up
>receiving byte strings in both EUC-JP and Latin-1 encodings.
>That's why I had to reinvent the wheel (namely, the "char" tag).
>
>
Right.
>On the other hand, in Python 2.x, the script will raise an error
>like this:
>
>UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)
>
>AFAIK, the standard "ascii" codec does not have such a nifty
>feature that would automatically translate unknown characters
>into appropriate character references. So, I can't see what you
>meant in the last paragraph quoted above. Could you please
>elaborate your assumption in the last paragraph? (What I'm
>afraid is that I might miss something new and important in
>recent versions of Python and PyXML.)
>
>
I am trying to say that your application does not have to rely on your
'char' tag hack under Python 2.x because you are now *able* to write it
in such a way that it doesn't do something foolish like "print content"
when content is a Unicode string and sys.stdout is an ASCII console. :)
For example, if you change that print to
print ''.join([c.encode('ascii', 'ignore') or "&#%d;" % ord(c) for c in
content])
then you will at least be able to see it on your terminal, serialized
with all non-ASCII characters represented by NCRs.
If you were writing to a file rather than sys.stdout, you would want to
change the 'ascii' in the line to 'EUC-JP' or whatever.