[XML-SIG] Processing xml files with ISO 8859-1 chars

Martin v. Loewis martin@v.loewis.de
Thu, 8 Nov 2001 10:59:32 +0100


> Sorry I don't know about all encodings. I don't think that there is a 
> principal problem to define encodings that use x00 for 'a'.

Well, there is: It wouldn't be an ASCII superset. Most real-life
encodings are ASCII supersets, unless they are EBCDIC
superset. Anything else would not survive long (except for special
markets, such as GSM short messages).

*This* specific encoding would have another problem: it wouldn't be C
compatible, since the \0 byte terminates a string, independent of the
encoding.

> I would speak of an encoding error if the content of a xml text is 
> erroneous with respect to the provided encoding info.
> So
> <?xml version="1.0" encoding="iso-8859-1"?>
> <bla>\129</bla>
> (where \... stands for the byte with decimal number ...)
> is incorrect, since \129 is not defined in iso-8859-1.
[Assuming you mean the character decimal 129 here; \129 is not
 a valid octal escape]

It certainly is. It represents the control character HOP (high octet
present), #x0081; see

http://208.56.196.240/misc/ISO-8859-1.HTML

*All* bytes are valid charaters in ISO-8859-1 (it is a common
misconception about Latin-1 that 128-159 are not defined).

Furthermore, this character (HOP) is even valid in XML character data:

Char    ::=    #x9 | #xA | #xD | 
               [#x20-#xD7FF] | 
               [#xE000-#xFFFD] | 
               [#x10000-#x10FFFF]

As you can see, only the low control block (C0) is partially excluded;
the high control block (C1, #x0080-#xx009F) is completely valid in XML
character data.

> Of course you cannot tell if a text in iso-latin1 is said to be
> encoded in iso-latin2 since they are formally equivalent (and you will
> output garbage if you convert that to unicode).

I cannot understand this statement. Latin-1 and Latin-2, are *not*
formally equivalent: even though they use the same bytes (namely, all
of them), but the bytes denote different characters.

> But that does not mean that you cannot check anything.

It seemed to me that you suggested that you can formally check whether
an input really is Latin-1; you cannot. You cannot formally check any
of the ISO-8859 encodings, unless you are presented with an EBCDIC
file, in which case even markup is encoded differently. You cannot
check UTF-16, either, unless you happen to run into a character that
has been excluded (such as an unpaired surrogate).

So I come back to my original claim: the only thing you can check in
practice is whether something could be UTF-8. Of course, it is even
possible to come up with a Latin-1 text that decodes as UTF-8
successfully.

Regards,
Martin