[Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

Thu Mar 21 13:23:55 EDT 2019

On Thu, 21 Mar 2019 at 17:05, Steve Holden <steve at holdenweb.com> wrote:
>
> On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>> [...]
>>
>> Most users and applications should /never/ care about the order of XML
>> attributes.
>>
>> Regards
>>
>> Antoine
>
>
> Especially as the standards specifically say that ordering has no semantic impact.
>
> Byte-by-byte comparison of XML is almost always inappropriate.

Conversely, if ordering has no semantic impact, there's no real
justification for asking for the current order to be changed.

In practice, allowing the user to control the ordering (by preserving
input order) gives users a way of handling (according to the standard)
broken consumers who ascribe semantic meaning to the attribute order.
So there's a small benefit for real-world users having to deal with
non-compliant software. But that benefit is by definition small, as
standards-compliant software won't be affected.

The cost of making the change to projects that rely on the current
output is significant, and that should be considered. But there's also
the question of setting a precedent. If we do reject this change
because of the cost to 3rd parties, are we then committing Python to
guaranteeing sorted attribute order (and worse, byte-for-byte
reproducible output) for ever - a far stronger commitment than the
standards require of us? That seems to me to be an extremely bad
precedent to set.

There's no good answer here - maybe a possible compromise would be for
us to document explicitly in 3.8 that output is only guaranteed
identical to the level the standards require (i.e., attribute order is
not guaranteed to be preserved) and then make this change in 3.9. But
in practice, that's not really any better for projects like coverage -
it just delays the point when they have to bite the bullet (and it's
not like 3.8 is imminent - there's plenty of time between now and 3.8
without adding an additional delay).

Reluctantly, I think I'd have to say that I don't think we should
reject this change simply because existing users rely on the exact
output currently being produced.

To mitigate the impact on 3rd parties, it would be very helpful if we
could add to the stdlib some form of "compare two XML documents for
semantic equality up to the level that the standards require". 3rd
party code could then use that if it's present, and fall back to
byte-equality if it's not. If we could get something like that for
3.9, but not for 3.8, then that would seem to me to be a good reason
to defer this change until 3.9 (because we don't want to have 3.8
being an exception where there's no semantic comparison function, but
the byte-equality fallback doesn't work - that's just needlessly
annoying).

Paul