[XML-SIG] Handling of character entity references

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Mon, 26 May 2003 04:56:34 +0900

"Thomas B. Passin" <tpassin@comcast.net> writes:
| [<pyxml@wonderclown.com>]
| > I am trying to produce XHTML files from input XML files which contain
| > a mixture of XHTML and custom markup.
| >...
| > I'm
| > having a problem, though, getting character entity references in the
| > source document to pass through to the output. Things like &amp;,
| > &lt;, and &gt; work fine, but &eacute; does not.
| >
| This sounds like a fine job for XSLT, rather than custom code...

I had a similar problem with Randall's one a few years ago, so
I'd like to describe my problem and a solution to it (just FYI:
I totally agree with the suggestion about XSLT).

I've used a SAX-based Python script for years to convert a set
of XML files into an HTML file.  The file encodings of the input
and output files are EUC-JP and ISO-2022-JP, respectively.
I also had a need to use Latin-1 characters in the input and
output files.  However, because of the Japanese file encodings,
raw character codes (say, 0xe9 in ISO-8859-1 for &eacute;) were
not acceptable.  Therefore, I needed a way to represent Latin-1
characters in the input XML files and to produce character
references in the output HTML file.

So, I decided to use a special markup to represent Latin-1
characters in the input XML files, as illustrated below:

<char name="eacute" />

I also changed the Python script so that appropriate character
references were generated in the output HTML file according to
the special markup.  (Numeric character references can also be
produced by the form <char code="..." />.)

I'm not sure this is a good solution and in fact the special
markup looks somewhat clumsy, but anyway it works fine.  When I
wrote the Python script, XSLT was at the stage of working draft
and there was no usable Python implementation of XSLT.  I'd use
XSLT if I rewrote the Python script now.


KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>