[XML-SIG] Handling of character entity references

Mike Brown mike@skew.org
Sun, 25 May 2003 17:42:11 -0600 (MDT)


Tamito KAJIYAMA wrote:
> "Thomas B. Passin" <tpassin@comcast.net> writes:
> |
> | [<pyxml@wonderclown.com>]
> | 
> | > I am trying to produce XHTML files from input XML files which contain
> | > a mixture of XHTML and custom markup.
> | >...
> | > I'm
> | > having a problem, though, getting character entity references in the
> | > source document to pass through to the output. Things like &amp;,
> | > &lt;, and &gt; work fine, but &eacute; does not.
> | >
> | 
> | This sounds like a fine job for XSLT, rather than custom code...
> 
> I had a similar problem with Randall's one a few years ago, so
> I'd like to describe my problem and a solution to it (just FYI:
> I totally agree with the suggestion about XSLT).
> 
> I've used a SAX-based Python script for years to convert a set
> of XML files into an HTML file.  The file encodings of the input
> and output files are EUC-JP and ISO-2022-JP, respectively.
> I also had a need to use Latin-1 characters in the input and
> output files.  However, because of the Japanese file encodings,
> raw character codes (say, 0xe9 in ISO-8859-1 for &eacute;) were
> not acceptable.  Therefore, I needed a way to represent Latin-1
> characters in the input XML files and to produce character
> references in the output HTML file.

This wouldn't be needed today, since python is now Unicode friendly. You have
Unicode strings being passed to your SAX methods, and on the output side, the
EUC-JP or ISO-2022-JP codec used by the XML serializer will convert to bytes
all the characters supported by those encodings. The non-ASCII range of
ISO-8859-1 (\u00A0-\u00FF) would not be handled by the codecs, but the XML
serializer will simply deal with that by emitting numeric character references
automatically.

> So, I decided to use a special markup to represent Latin-1
> characters in the input XML files, as illustrated below:
> 
> <char name="eacute" />

Similar project:
http://xmlchar.sourceforge.net/

Seems like it's past the point of diminishing returns, to me..