REALLY simple xml reader
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Feb 2 05:44:39 EST 2008
On Sat, 02 Feb 2008 07:24:36 +0100, Stefan Behnel wrote:
> Steven D'Aprano wrote:
>> The same way it knows that "<?xml" is "<?xml" before it sees the
>> encoding. If the parser knows that the hex bytes
>>
>> 3c 3f 78 6d 6c
>>
>> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free
>> to swap the byte order)
>>
>> mean "<?xml"
>>
>> then it can equally know that bytes
>>
>> 20 09 0a
>>
>> are whitespace. According to the XML standard, what else could they be?
>
> So, what about all the other unicode whitespace characters?
What about them? They aren't part of the XML spec, which defines
whitespace as the code points #x20, #x9, #xD and #xA. (Okay, I forgot
carriage return. Oops.) You don't have to support arbitrary whitespace,
only those four characters.
> And what
> about different encodings and byte orders that move the bytes around?
What about them? The Byte Order Mark is optional in the case of UTF-8,
and compulsory in the case of UTF-16. I quote:
"Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors must be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."
So if your XML document is written in UTF-8, you don't need a BOM
(although you can use one if you wish) and if it is in UTF-16 you *must*
have one, even before the '<?xml'. If you don't, how will the parser
recognise the characters '<?xml', not to mention the characters
'encoding' and 'utf-16'?
> Is
> it ok for a byte stream to start with "00 20" or does it have to start
> with "20 00"?
If you're using UTF-16, the byte stream MUST start with the BOM, so no,
the above is illegal. If the BOM has already been seen, then it will tell
the XML parser which order is legal, depending on whether the BOM was FF
FE or FE FF.
If you're using UTF-8, the byte streams "00 20" and "20 00" would both be
illegal: in UTF-8, the null byte is the unicode code point #x0, which is
illegal in XML.
Support for any other encoding is entirely optional. A parser may choose
to support other encodings, or not, and deal with them appropriately. But
whatever encodings you support, the same issue comes up: if you can
recognise '<?xml' before seeing the encoding, why can't you recognise
whitespace?
> What about "00 20 00 00" and "00 00 00 20"? Are you sure
> that means 0x20 encoded in 4 bytes, or is it actually the unicode
> character 0x2000? What complexity do you want to put into the parser
> here?
I'm not putting any complexity into the parser that the XML standard
doesn't already demand. Perhaps you should read it yourself:
http://www.w3.org/TR/xml/
In particular, note that a parser must be prepared to accept leading
whitespace at the start of a document, and only reject it if it comes
across a XML declaration.
> "In the face of ambiguity, refuse the temptation to guess"
What ambiguity, and what guess?
My earlier question wasn't rhetorical. I asked "According to the XML
standard, what else could they [whitespace] be?". Just implying that they
are ambiguous doesn't actually make them ambiguous.
I don't believe there is an ambiguity at all. That's what makes the
prohibition on leading whitespace before the '<?xml' tag all the more
puzzling: there doesn't seem to be any good reason for it.
If I am wrong, then will somebody please put me out of my misery and tell
me what leading whitespace could be mistaken for, in what circumstances?
--
Steven
More information about the Python-list
mailing list