[Python-Dev] please back out changeset f903cf864191 before alpha-2

Sat Aug 24 00:57:48 CEST 2013

Hi,

ticket 17741 has introduced a new feature in the xml.etree.ElementTree
module that was added without any major review.

http://bugs.python.org/issue17741

http://hg.python.org/cpython/rev/f903cf864191

I only recently learned about this addition and after taking a couple of
closer looks, I found that it represents a rather seriously degradation of
the ET module API. Since I am not a core developer, it is rather easy for
the original committer to close the ticket and to sit and wait for the next
releases to come in order to get the implementation out. So I'm sorry for
having to ask publicly on this list for the code to be removed, before it
does any harm in the wild.

Let me explain why the addition is such a disaster. I'm personally involved
in this because I will eventually have to implement features that occur in
ElementTree also in the external lxml.etree package. And this is not an API
that I can implement with a clear conscience.

There are two parts to parsing in the ET API. The first is the parser, and
the second is the target object (similar to a SAX handler). The parser has
an incremental parsing API consisting of the functions .feed() and
.close(). When it receives data through the .feed() method, it parses it
and passes events on to the target. The target is commonly a TreeBuilder
that builds an in-memory tree, but is not limited to that. Calling the
.close() method tells the parser that the parsing is done and that it
should finish it up.

The class that was now added is called "IncrementalParser". It has two
methods for passing data in: "data_received()" and "eof_received()". So the
first thing to note is that this addition is only a copy of the existing
API and functionality, but under different names. It is hard to understand
for me how anyone could consider this a consistent design.

Then, the purpose of this addition was to provide a way to collect parse
events. That is the obvious role of the target object. In the new
implementation, the target object is being instantiated, but not actually
meant to collect the events. Instead, it's the parser collecting the
events, based on what the target object returns (which it doesn't currently
have to do at all). This is totally backwards. Instead, it should be up to
the target object to decide which events to collect, how to process them
and how to present them to the user. This is clearly how the API was
originally designed.

Also, the IncrementalParser doesn't directly collect the events itself but
gets them through a sort of backdoor in the underlying parser. That parser
object is actually being passed into the IncrementalParser as a parameter,
which means that user provided parser objects will also have to implement
that backdoor now, even though they may not actually be able to provide
that functionality.

My proposal for fixing these obvious design problems is to let each part of
the parsing chain do what it's there for. Use the existing XMLParser (or an
HTMLParser, as in lxml.etree) to feed in data incrementally, and let the
target object process and collect the events. So, instead of replacing the
parser interface with a new one, there should be a dedicated target object
(or maybe just a wrapper for a TreeBuilder) that collects the parse events
in this specific way.

Since the time is running in favour of the already added implementation,
I'm asking to back it out for the time being. I specifically do not want to
rush in a replacement. Once there is an implementation that matches the
established API, I'm all in favour of adding it, because the feature itself
is a good idea. But keeping a wrong design in, just because "it's there",
even before anyone has actually started using it, is just asking for future
deprecation hassle. It's not too late for removal now, but it will be in a
couple of weeks if it is not done now.

Stefan