[XML-SIG] XML and Unicode
Martin v. Loewis
Thu, 24 May 2001 00:37:03 +0200
> I'm not sure I understand your previous message - noone has suggested
> that it's Windows CP 1252 (although I may have missed messages), and
> I'm not sure what you mean by 'consider the document as ISO-8859-1';
> I'm feeding a document into an XML parser with encoding="ISO-8859-1",
> and getting unicode strings out of it.
There simply is no em-dash in ISO-8859-1; this is a Microsoft
invention. Microsoft organizes character sets in code pages (an idea
taken from IBM). For Code Page 1252, we have the character assignments
<-N> /x96 <U2013> EN DASH
<-M> /x97 <U2014> EM DASH
So the characters '\x96' and '\x97', when interpreted as CP 1252,
identify EN DASH and EM DASH, respectively.
In ISO 8859-1, these characters have the meanings
<SG> /x96 <U0096> START OF GUARDED AREA (SPA)
<EG> /x97 <U0097> END OF GUARDED AREA (EPA)
As you can see, they are considered control characters in ISO-8859-1.
So if you want the character to be treated as EM DASH, you should
identify the character set as CP 1252, not ISO-8859-1.
Doing so, in turn, will result in the Unicode characters U+2013 and
U+2014 being used, instead of the Unicode characters U+0096 and U+0097
(which identify control characters).
Now, assuming that you correctly identify your character set, XML
parsers may refuse your document in case they don't know what cp-1252
is. Even if that succeeds, converting the resulting Unicode strings to
ISO-8859-1 will fail, as EM DASH has no representation in that
character set. Of course, conversion into UTF-8 will succeed in any
case - all Unicode characters are representable in UTF-8
> What mechanism do I have to consider it as having a particular
> encoding, beyond the XML declaration?
Sorry, I cannot understand this question; please rephrase.
> I've been given the impression that unicode strings are
That impression is correct. Unfortunately, byte-oriented files are not
encoding-neutral, so when you read or write from/to a byte stream, you
have to know its encoding.
P.S. If you have a browser that displays '\x96' as EN DASH even if the
encoding is ISO-8859-1, this browser is broken - it should treat the
character as START OF GUARDED AREA. I could not figure out what the
exact meaning of this character is, something along the lines: text
between SPA and EPA is "guarded", i.e. it cannot be edited or cleared.
I doubt any browser implements that.