[XML-SIG] Handling of character entity references

Thomas B. Passin tpassin@comcast.net
Mon, 26 May 2003 19:00:52 -0400


[Randall Nortman

> I have done similar projects using XSLT in the past, since that is
> clearly the "right" tool for this job. However, I find XSLT's syntax
> to be so abhorrent to good taste as to be nausea-inducing.

Strangely enough, I have come to be very fond of xslt, and I am fine with
its syntax.  This has come as a surprise to me.

> I was already specifying the encoding as utf-8 on the output, but I
> customized PrintVisitor to remove the <?xml version='1.0'
> encoding='utf=8'?> prolog because it was confusing some older
> browsers.

Yes, HTML does not know about the xml declaration.

> So, per your suggestion, I added a <meta> in the <head>
> section to specify the character set which seems to work in most
> browsers. However, I found that some browsers (notably w3m, a
> text-only browser) does not support utf-8 without patches, and so I
> switched to iso-8859-1, which seems to work just about anywhere.
>

Pretty safe most of the time.

> Or at least, that works for "&eacute;". "&nbsp;" is still not
> working. When I use it in my source, the output just has nothing where
> there should be a "&#160;". This is true whether I use iso-8859-1 or
> utf-8. Any ideas on that?
>

No, except to just use & # 160 ;
>
> > This question comes up a lot.  Look at the various xslt FAQs and try
Google
> > for more discussion.  Look in the archive of this list, too.
> [...]
>
> I hate asking FAQ's on mailing lists, so I'm sorry that I apparently
> ended up doing just that.

Sorry, I meant to say the Mulberry xslt list, and maybe xml-dev as well.

> >
> > Well, that is interesting because it is the utf-8 encoding of the value
E9,
> > which is the latin-1 encoding of eacute.  However, the unicode character
for
> > &eacute; (you can see this in your DTD) is U+00C9, which would be
encoded in
> > utf-8 as C3 89.  Therefore your code is not decoding and encoding the
input
> > correctly.  You seem to be taking the sequence of bytes of a latin-1
source
> > and encoding it into utf-8 as if the source were really in unicode
instead
> > of latin-1.
> [...]
>
> Are you sure that it should be C3 89 instead of C3 A9? The latter
> seems to work, so long as I direct the browser to expect utf-8.
>

I misunderstood the legend of one column on the reference table I was
looking at, and ended up thinking the code point was something else,
although actually it is the same as the Latin-1 encoding value, which is E9
byte after all.  It took me some time testing and looking for better
character charts before I got my head squared away.  Sorry about that.

Cheers,

Tom P