[XML-SIG] Handling of character entity references
Tamito KAJIYAMA
kajiyama@grad.sccs.chukyo-u.ac.jp
Tue, 27 May 2003 17:52:39 +0900
Mike Brown <mike@skew.org> writes:
|
| You said you're using SAX to produce HTML from XML, so I assume the XML
| parser is calling the event handler methods in your ContentHandler. When
| ContentHandler.characters() is called by the parser to notify your
| application about character data, a Unicode string is passed as the
| content argument (as long as expat is your underlying parser). This is
| probably not how it worked when your application was originally written,
| prior to the omnipresence of Unicode in Python.
|
| Whatever mechanism you are using to produce HTML (I'm not going to guess
| how you're doing that) will be running the Unicode string through an
| encoder, perhaps just using the built-in encode() method on the Unicode
| string object, to produce EUC-JP or ISO-2022-JP byte strings for output.
|
| Of course this isn't automatic, but my point is that (hopefully) your
| HTML-producing SAX application will be written (by you) such that it
| does do the encoding (at the last step before output, preferably), and
| will be smart enough (because you wrote it that way) to write character
| references when the codec doesn't handle a particular Unicode character.
Hmm, I've still missed the point. Do you mean that there is a
codec with an error handling scheme that translates undefined
characters into appropriate character references? The SAX
application of mine does not have (of course) such a mechanism
that would "write character references when the codec doesn't
handle a particular Unicode character," since it does not rely
on Unicode support at all.
Let me make the discussion a bit more concrete. The following
script effectively reproduces the problem that I encountered.
(I rewrote the same conversion logic in SAX2.)
--------------------------------------------------------------
import StringIO, string
from xml.sax import saxutils, sax2exts
class MyHandler(saxutils.DefaultHandler):
def characters(self, content):
if string.strip(content):
print "DEBUG:", repr(content)
print content
DOC = """\
<?xml version="1.0" encoding="EUC-JP" ?>
<!DOCTYPE doc [
<!ENTITY eacute "é">
]>
<doc>
<p>Isto é uma caneta.</p>
<p>\244\263\244\354\244\317\245\332\245\363\244\307\244\271\241\243</p>
</doc>
"""
parser = sax2exts.make_parser(["xml.sax.drivers2.drv_xmlproc"])
parser.setContentHandler(MyHandler())
parser.parse(StringIO.StringIO(DOC))
--------------------------------------------------------------
In Python 1.5.2, the method ContentHandler.characters() ends up
receiving byte strings in both EUC-JP and Latin-1 encodings.
That's why I had to reinvent the wheel (namely, the "char" tag).
On the other hand, in Python 2.x, the script will raise an error
like this:
UnicodeEncodeError: 'ascii' codec can't encode character '\ue9' in position 0: ordinal not in range(128)
AFAIK, the standard "ascii" codec does not have such a nifty
feature that would automatically translate unknown characters
into appropriate character references. So, I can't see what you
meant in the last paragraph quoted above. Could you please
elaborate your assumption in the last paragraph? (What I'm
afraid is that I might miss something new and important in
recent versions of Python and PyXML.)
Thanks,
--
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>