[XML-SIG] Processing xml files with ISO 8859-1 chars
Martin v. Loewis
martin@v.loewis.de
Thu, 8 Nov 2001 10:59:32 +0100
> Sorry I don't know about all encodings. I don't think that there is a
> principal problem to define encodings that use x00 for 'a'.
Well, there is: It wouldn't be an ASCII superset. Most real-life
encodings are ASCII supersets, unless they are EBCDIC
superset. Anything else would not survive long (except for special
markets, such as GSM short messages).
*This* specific encoding would have another problem: it wouldn't be C
compatible, since the \0 byte terminates a string, independent of the
encoding.
> I would speak of an encoding error if the content of a xml text is
> erroneous with respect to the provided encoding info.
> So
> <?xml version="1.0" encoding="iso-8859-1"?>
> <bla>\129</bla>
> (where \... stands for the byte with decimal number ...)
> is incorrect, since \129 is not defined in iso-8859-1.
[Assuming you mean the character decimal 129 here; \129 is not
a valid octal escape]
It certainly is. It represents the control character HOP (high octet
present), #x0081; see
http://208.56.196.240/misc/ISO-8859-1.HTML
*All* bytes are valid charaters in ISO-8859-1 (it is a common
misconception about Latin-1 that 128-159 are not defined).
Furthermore, this character (HOP) is even valid in XML character data:
Char ::= #x9 | #xA | #xD |
[#x20-#xD7FF] |
[#xE000-#xFFFD] |
[#x10000-#x10FFFF]
As you can see, only the low control block (C0) is partially excluded;
the high control block (C1, #x0080-#xx009F) is completely valid in XML
character data.
> Of course you cannot tell if a text in iso-latin1 is said to be
> encoded in iso-latin2 since they are formally equivalent (and you will
> output garbage if you convert that to unicode).
I cannot understand this statement. Latin-1 and Latin-2, are *not*
formally equivalent: even though they use the same bytes (namely, all
of them), but the bytes denote different characters.
> But that does not mean that you cannot check anything.
It seemed to me that you suggested that you can formally check whether
an input really is Latin-1; you cannot. You cannot formally check any
of the ISO-8859 encodings, unless you are presented with an EBCDIC
file, in which case even markup is encoded differently. You cannot
check UTF-16, either, unless you happen to run into a character that
has been excluded (such as an unpaired surrogate).
So I come back to my original claim: the only thing you can check in
practice is whether something could be UTF-8. Of course, it is even
possible to come up with a Latin-1 text that decodes as UTF-8
successfully.
Regards,
Martin