REALLY simple xml reader
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Fri Feb 1 20:44:50 EST 2008
On Thu, 31 Jan 2008 18:35:17 +0100, Stefan Behnel wrote:
> Hi,
>
> Steven D'Aprano wrote:
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>
>>> Quite apart from a human thinking it's pretty or not pretty, it's *not
>>> valid XML* if the XML declaration isn't immediately at the start of
>>> the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
>>> parsers will (correctly) reject such a document.
>>
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
> [had a good laugh here]
>> This is legal XML:
>>
>> """<?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
>>
>> and so is this:
>>
>> """
>> <greeting >Hello, world!</greeting >"""
>>
>>
>> but not this:
>>
>> """ <?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
>
> It's actually not that stupid. When you leave out the declaration, then
> the XML is UTF-8 encoded (by spec), so normal ASCII whitespace doesn't
> matter. It's just like the declaration had come *before* the whitespace,
> at the very beginning of the byte stream.
>
> But if you add a declaration, then the encoding can change for the whole
> document (including the declaration!), so you have to give the parser a
> chance to actually parse the declaration. How is it supposed to know
> that the whitespace before the declaration *is* whitespace before it
> knows the encoding?
The same way it knows that "<?xml" is "<?xml" before it sees the
encoding. If the parser knows that the hex bytes
3c 3f 78 6d 6c
(or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free to
swap the byte order)
mean "<?xml"
then it can equally know that bytes
20 09 0a
are whitespace. According to the XML standard, what else could they be?
--
Steven
More information about the Python-list
mailing list