[XML-SIG] Handling of character entity references

Mike Brown mike@skew.org
Mon, 26 May 2003 22:16:16 -0600


Tamito KAJIYAMA wrote:

>Mike Brown <mike@skew.org> writes:
>|
>| > I've used a SAX-based Python script for years to convert a set
>| > of XML files into an HTML file.  The file encodings of the input
>| > and output files are EUC-JP and ISO-2022-JP, respectively.
>| > I also had a need to use Latin-1 characters in the input and
>| > output files.  However, because of the Japanese file encodings,
>| > raw character codes (say, 0xe9 in ISO-8859-1 for &eacute;) were
>| > not acceptable.  Therefore, I needed a way to represent Latin-1
>| > characters in the input XML files and to produce character
>| > references in the output HTML file.
>|=20
>| This wouldn't be needed today, since python is now Unicode friendly. Y=
ou have
>| Unicode strings being passed to your SAX methods, and on the output si=
de, the
>| EUC-JP or ISO-2022-JP codec used by the XML serializer will convert to=
 bytes
>| all the characters supported by those encodings. The non-ASCII range o=
f
>| ISO-8859-1 (\u00A0-\u00FF) would not be handled by the codecs, but the=
 XML
>| serializer will simply deal with that by emitting numeric character re=
ferences
>| automatically.
>|=20
>| > So, I decided to use a special markup to represent Latin-1
>| > characters in the input XML files, as illustrated below:
>| >=20
>| > <char name=3D"eacute" />
>
>Thank you for the tip, but I was not able to figure out how to
>realize what you suggested.  To the best of my knowledge, the
>method ContentHandler.characters() determines how characters are
>to be handled on the output side.  So, encoding-related things
>won't be _automatically_ done.  I've missed something important?
>
>Thanks,
>
> =20
>
You said you're using SAX to produce HTML from XML, so I assume the XML=20
parser is calling the event handler methods in your ContentHandler. When=20
ContentHandler.characters() is called by the parser to notify your=20
application about character data, a Unicode string is passed as the=20
content argument (as long as expat is your underlying parser). This is=20
probably not how it worked when your application was originally written,=20
prior to the omnipresence of Unicode in Python.

Whatever mechanism you are using to produce HTML (I'm not going to guess=20
how you're doing that) will be running the Unicode string through an=20
encoder, perhaps just using the built-in encode() method on the Unicode=20
string object, to produce EUC-JP or ISO-2022-JP byte strings for output.

Of course this isn't automatic, but my point is that (hopefully) your=20
HTML-producing SAX application will be written (by you) such that it=20
does do the encoding (at the last step before output, preferably), and=20
will be smart enough (because you wrote it that way) to write character=20
references when the codec doesn't handle a particular Unicode character.=20
Someone (on this list, I think) once suggested this approach:

''.join([c.encode('ascii', 'ignore') or "&#%d;" % ord(c) for c in u'\u831=
4=E4=F6=FC=DF?abc'])

...although you could use some of the character data translation=20
functions in PyXML's xml.dom.html or xml.dom.ext.Printer modules. I'm=20
stepping beyond my own personal experience in this matter, though. :)