[XML-SIG] Handling of character entity references

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Mon, 26 May 2003 16:50:24 +0900


Mike Brown <mike@skew.org> writes:
|
| > I've used a SAX-based Python script for years to convert a set
| > of XML files into an HTML file.  The file encodings of the input
| > and output files are EUC-JP and ISO-2022-JP, respectively.
| > I also had a need to use Latin-1 characters in the input and
| > output files.  However, because of the Japanese file encodings,
| > raw character codes (say, 0xe9 in ISO-8859-1 for &eacute;) were
| > not acceptable.  Therefore, I needed a way to represent Latin-1
| > characters in the input XML files and to produce character
| > references in the output HTML file.
| 
| This wouldn't be needed today, since python is now Unicode friendly. You have
| Unicode strings being passed to your SAX methods, and on the output side, the
| EUC-JP or ISO-2022-JP codec used by the XML serializer will convert to bytes
| all the characters supported by those encodings. The non-ASCII range of
| ISO-8859-1 (\u00A0-\u00FF) would not be handled by the codecs, but the XML
| serializer will simply deal with that by emitting numeric character references
| automatically.
| 
| > So, I decided to use a special markup to represent Latin-1
| > characters in the input XML files, as illustrated below:
| > 
| > <char name="eacute" />

Thank you for the tip, but I was not able to figure out how to
realize what you suggested.  To the best of my knowledge, the
method ContentHandler.characters() determines how characters are
to be handled on the output side.  So, encoding-related things
won't be _automatically_ done.  I've missed something important?

Thanks,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>