[XML-SIG] XML and Unicode

Thu, 24 May 2001 00:37:03 +0200

> I'm not sure I understand your previous message - noone has suggested
> that it's Windows CP 1252 (although I may have missed messages), and
> I'm not sure what you mean by 'consider the document as ISO-8859-1';
> I'm feeding a document into an XML parser with encoding="ISO-8859-1",
> and getting unicode strings out of it. 

There simply is no em-dash in ISO-8859-1; this is a Microsoft
invention.  Microsoft organizes character sets in code pages (an idea
taken from IBM). For Code Page 1252, we have the character assignments

<-N>                   /x96   <U2013> EN DASH
<-M>                   /x97   <U2014> EM DASH

So the characters '\x96' and '\x97', when interpreted as CP 1252,
identify EN DASH and EM DASH, respectively.

In ISO 8859-1, these characters have the meanings

<SG>                   /x96   <U0096> START OF GUARDED AREA (SPA)
<EG>                   /x97   <U0097> END OF GUARDED AREA (EPA)

As you can see, they are considered control characters in ISO-8859-1.
So if you want the character to be treated as EM DASH, you should
identify the character set as CP 1252, not ISO-8859-1.

Doing so, in turn, will result in the Unicode characters U+2013 and
U+2014 being used, instead of the Unicode characters U+0096 and U+0097
(which identify control characters).

Now, assuming that you correctly identify your character set, XML
parsers may refuse your document in case they don't know what cp-1252
is. Even if that succeeds, converting the resulting Unicode strings to
ISO-8859-1 will fail, as EM DASH has no representation in that
character set. Of course, conversion into UTF-8 will succeed in any
case - all Unicode characters are representable in UTF-8

> What mechanism do I have to consider it as having a particular
> encoding, beyond the XML declaration?

Sorry, I cannot understand this question; please rephrase.

> I've been given the impression that unicode strings are
> encoding-neutral.

That impression is correct. Unfortunately, byte-oriented files are not
encoding-neutral, so when you read or write from/to a byte stream, you
have to know its encoding.

Regards,
Martin

P.S. If you have a browser that displays '\x96' as EN DASH even if the
encoding is ISO-8859-1, this browser is broken - it should treat the
character as START OF GUARDED AREA. I could not figure out what the
exact meaning of this character is, something along the lines: text
between SPA and EPA is "guarded", i.e. it cannot be edited or cleared.
I doubt any browser implements that.