[Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

Nathaniel Smith njs at pobox.com
Mon Mar 18 19:15:30 EDT 2019


I noticed that your list doesn't include "add a DOM equality operator".
That seems potentially simpler to implement than canonical XML
serialization, and like a useful thing to have in any case. Would it make
sense as an option?

On Mon, Mar 18, 2019, 15:46 Raymond Hettinger <raymond.hettinger at gmail.com>
wrote:

> We're having a super interesting discussion on
> https://bugs.python.org/issue34160 .  It is now marked as a release
> blocker and warrants a broader discussion.
>
> Our problem is that at least two distinct and important users have written
> tests that depend on exact byte-by-byte comparisons of the final
> serialization.  So any changes to the XML modules will break those tests
> (not the applications themselves, just the test cases that assume the
> output will be forever, byte-by-byte identical).
>
> In theory, the tests are incorrectly designed and should not treat the
> module output as a canonical normal form.  In practice, doing an equality
> test on the output is the simplest, most obvious approach, and likely is
> being done in other packages we don't know about yet.
>
> With pickle, json, and __repr__, the usual way to write a test is to
> verify a roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With
> XML, the problem is that the DOM doesn't have an equality operator.  The
> user is left with either testing specific fragments with
> element.find(xpath) or with using a standards compliant canonicalization
> package (not available from us). Neither option is pleasant.
>
> The code in the current 3.8 alpha differs from 3.7 in that it removes
> attribute sorting and instead preserves the order the user specified when
> creating an element.  As far as I can tell, there is no objection to this
> as a feature.  The problem is what to do about the existing tests in
> third-party code, what guarantees we want to make going forward, and what
> do we recommend as a best practice for testing XML generation.
>
> Things we can do:
>
> 1) Revert back to the 3.7 behavior. This of course, makes all the test
> pass :-)  The downside is that it perpetuates the practice of bytewise
> equality tests and locks in all implementation quirks forever.  I don't
> know of anyone advocating this option, but it is the simplest thing to do.
>
> 2). Go into every XML module and add attribute sorting options to each
> function that generate xml.  This gives users a way to make their tests
> pass for now. There are several downsides. a) It grows the API in a way
> that is inconsistent with all the other XML packages I've seen. b) We'll
> have to test, maintain, and document the API forever -- the API is already
> large and time consuming to teach. c) It perpetuates the notion that
> bytewise equality tests are the right thing to do, so we'll have this
> problem again if substitute in another code generator or alter any of the
> other implementation quirks (i.e. how CDATA sections are serialized).
>
> 3) Add a standards compliant canonicalization tool (see
> https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the
> right-way-to-do-it but takes time and energy.
>
> 4) Fix the tests in the third-party modules to be more focused on their
> actual test objectives, the semantics of the generated XML rather than the
> exact serialization.  This option would seem like the right-thing-to-do but
> it isn't trivial because the entire premise of the existing test is
> invalid.  For every case, we'll actually have to think through what the
> test objective really is.
>
> Of these, option 2 is my least preferred.  Ideally, we don't guarantee
> bytewise identical output across releases, and ideally we don't grow a new
> API that perpetuates the issue. That said, I'm not wedded to any of these
> options and just want us to do what is best for the users in the long run.
>
> Regardless of option chosen, we should make explicit whether on not the
> Python standard library modules guarantee cross-release bytewise identical
> output for XML. That is really the core issue here.  Had we had an explicit
> notice one way or the other, there wouldn't be an issue now.
>
> Any thoughts?
>
>
>
> Raymond Hettinger
>
>
> P.S.   Stefan Behnel is planning to remove attribute sorting from lxml.
> On the bug tracker, he has clearly articulated his reasons.
>
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/njs%40pobox.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20190318/2736d92b/attachment-0001.html>


More information about the Python-Dev mailing list