please back out changeset f903cf864191 before alpha-2
Hi, ticket 17741 has introduced a new feature in the xml.etree.ElementTree module that was added without any major review. http://bugs.python.org/issue17741 http://hg.python.org/cpython/rev/f903cf864191 I only recently learned about this addition and after taking a couple of closer looks, I found that it represents a rather seriously degradation of the ET module API. Since I am not a core developer, it is rather easy for the original committer to close the ticket and to sit and wait for the next releases to come in order to get the implementation out. So I'm sorry for having to ask publicly on this list for the code to be removed, before it does any harm in the wild. Let me explain why the addition is such a disaster. I'm personally involved in this because I will eventually have to implement features that occur in ElementTree also in the external lxml.etree package. And this is not an API that I can implement with a clear conscience. There are two parts to parsing in the ET API. The first is the parser, and the second is the target object (similar to a SAX handler). The parser has an incremental parsing API consisting of the functions .feed() and .close(). When it receives data through the .feed() method, it parses it and passes events on to the target. The target is commonly a TreeBuilder that builds an in-memory tree, but is not limited to that. Calling the .close() method tells the parser that the parsing is done and that it should finish it up. The class that was now added is called "IncrementalParser". It has two methods for passing data in: "data_received()" and "eof_received()". So the first thing to note is that this addition is only a copy of the existing API and functionality, but under different names. It is hard to understand for me how anyone could consider this a consistent design. Then, the purpose of this addition was to provide a way to collect parse events. That is the obvious role of the target object. In the new implementation, the target object is being instantiated, but not actually meant to collect the events. Instead, it's the parser collecting the events, based on what the target object returns (which it doesn't currently have to do at all). This is totally backwards. Instead, it should be up to the target object to decide which events to collect, how to process them and how to present them to the user. This is clearly how the API was originally designed. Also, the IncrementalParser doesn't directly collect the events itself but gets them through a sort of backdoor in the underlying parser. That parser object is actually being passed into the IncrementalParser as a parameter, which means that user provided parser objects will also have to implement that backdoor now, even though they may not actually be able to provide that functionality. My proposal for fixing these obvious design problems is to let each part of the parsing chain do what it's there for. Use the existing XMLParser (or an HTMLParser, as in lxml.etree) to feed in data incrementally, and let the target object process and collect the events. So, instead of replacing the parser interface with a new one, there should be a dedicated target object (or maybe just a wrapper for a TreeBuilder) that collects the parse events in this specific way. Since the time is running in favour of the already added implementation, I'm asking to back it out for the time being. I specifically do not want to rush in a replacement. Once there is an implementation that matches the established API, I'm all in favour of adding it, because the feature itself is a good idea. But keeping a wrong design in, just because "it's there", even before anyone has actually started using it, is just asking for future deprecation hassle. It's not too late for removal now, but it will be in a couple of weeks if it is not done now. Stefan
On Sat, 24 Aug 2013 00:57:48 +0200
Stefan Behnel
Hi,
ticket 17741 has introduced a new feature in the xml.etree.ElementTree module that was added without any major review.
As I've already indicated on the tracker, I'm completely saturated with Stefan's qualms about a minor API addition and I'm not willing to process anymore of them. Hence I won't respond to the bulk of his e-mail. But I still want to clarify that claiming that the feature was "added without any major review" is outrageous and manipulative. Perhaps Stefan thinks that an ElementTree code review not by him (but, for example, by Eli, who currently maintains ElementTree) is not "major". Well, good for him. His best chance to influence the development process, though, is to contribute more, not harass active developers. Regards Antoine.
Antoine Pitrou, 24.08.2013 01:26:
On Sat, 24 Aug 2013 00:57:48 +0200 Stefan Behnel wrote:
ticket 17741 has introduced a new feature in the xml.etree.ElementTree module that was added without any major review.
As I've already indicated on the tracker, I'm completely saturated with Stefan's qualms about a minor API addition and I'm not willing to process anymore of them. Hence I won't respond to the bulk of his e-mail.
But I still want to clarify that claiming that the feature was "added without any major review" is outrageous and manipulative. Perhaps Stefan thinks that an ElementTree code review not by him (but, for example, by Eli, who currently maintains ElementTree) is not "major".
The reason why I'm saying this is that the way the change came in is rather - unorthodox. As Antoine noted in the ticket, he proposed the change on the tulip mailing list. That is a completely wrong place to discuss a new XML API, as can be seen from the replies. http://thread.gmane.org/gmane.comp.python.tulip/171 Specifically, no-one noticed the major overlap with the existing API and functionality at that point, nor the contradictions between the existing API and the new one. In the ticket, Eli stated that he didn't have time for the review ATM, and then, two days later, commented that the patch looks good. To me (and I'm really only interpreting here), this indicates that the review was mostly at the patch level. Note that he didn't comment on the API overlap either, so my guess is that he just didn't notice it. In my experience, reviewing design and thinking about alternatives takes more time than that, especially when you're "swamped", as he put it. I mean, it took *me* almost a day to dig into the implications and into the patch (as can be seen by my incremental comments), and I have the background of having written a complete implementation of that library myself. So, to put it more nicely, I think this feature was added without the amount of review that it needs, and now that I've given it that review, I'm asking for removal of the feature and a proper redesign that fits into the existing library. Stefan
On 24 August 2013 15:32, Stefan Behnel
So, to put it more nicely, I think this feature was added without the amount of review that it needs, and now that I've given it that review, I'm asking for removal of the feature and a proper redesign that fits into the existing library.
FWIW, it seems to me that this is something that could live in *tulip* as an adapter between the tulip data processing APIs and the existing ElementTree incremental parsing APIs, without needing to be added directly to xml.etree at all. It certainly seems premature to be adding tulip-inspired APIs to other parts of the standard library, when tulip itself hasn't been deemed ready for inclusion. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, 24 Aug 2013 15:57:50 +1000
Nick Coghlan
On 24 August 2013 15:32, Stefan Behnel
wrote: So, to put it more nicely, I think this feature was added without the amount of review that it needs, and now that I've given it that review, I'm asking for removal of the feature and a proper redesign that fits into the existing library.
FWIW, it seems to me that this is something that could live in *tulip* as an adapter between the tulip data processing APIs and the existing ElementTree incremental parsing APIs, without needing to be added directly to xml.etree at all.
This is not an adapter, it's a new feature that ElementTree wasn't providing before. The need to process data in a non-blocking way (not merely incremental) didn't appear with Tulip, and the fact that the API is *inspired* by Tulip doesn't make it Tulip-specific. (I'm also curious why Tulip would start providing data-processing APIs: until now, I thought it's supposed to be a networking library :-)) Furthermore, such a feature has to access implementation details of ElementTree, so it's only natural that it be provided in ElementTree. By the way, just know that Stefan tried to provide a patch that would better suit his API desires, and failed because ElementTree's current implementation makes it difficult to do so. Someone can take the whole thing over if they want to, change the API and make it more shiny or different, tweak the implementation to suit it better to their own aesthetic sensibilities, but please don't revert an useful feature unless it's based on concrete, serious issues rather than a platonic disagreement about design. Regards Antoine.
Antoine Pitrou, 24.08.2013 12:58:
By the way, just know that Stefan tried to provide a patch that would better suit his API desires, and failed because ElementTree's current implementation makes it difficult to do so.
Absolutely. I agree that your current implementation is a hack that works around these issues. That doesn't mean that they go away, though. And yes, I even provided a half-finished implementation, even though I'm certainly not interested enough in this feature to make much use of it. Why don't you just take a look at my patch and finish it up? Given that you are apparently the most ambitious supporter of this feature, I would expect you to provide an appropriate implementation and have it reviewed. And with "reviewed" I also mean "accept criticism". Just because you managed to sneak in a hack doesn't mean it has to stay there once it's uncovered.
Someone can take the whole thing over if they want to, change the API and make it more shiny or different, tweak the implementation to suit it better to their own aesthetic sensibilities, but please don't revert an useful feature unless it's based on concrete, serious issues rather than a platonic disagreement about design.
As I said, the only reason why the current implementation is there is "because it's there". The problems of the current iterparse implementation should not be taken as a reason for a design decision of a new feature. Instead, they should be fixed and the feature should be based on these fixes. Yes, that's more work than adding a hack. But I'm sure that cleaning up first will pay off quite quickly. I already gave lots of reasons for that. Stefan
On Sat, 24 Aug 2013 14:46:32 +0200
Stefan Behnel
As I said, the only reason why the current implementation is there is "because it's there".
No. It works, it's functional, it fills an use case, and it doesn't seem to have any concrete issues. Get over it, Stefan, and stop trolling us.
On Sat, 24 Aug 2013 14:53:13 +0200, Antoine Pitrou
On Sat, 24 Aug 2013 14:46:32 +0200 Stefan Behnel
wrote: As I said, the only reason why the current implementation is there is "because it's there".
No. It works, it's functional, it fills an use case, and it doesn't seem to have any concrete issues.
Get over it, Stefan, and stop trolling us.
Stefan is not trolling. He's raising objections that you disagree with. It costs nothing to keep the discussion civil. --David
On 24 August 2013 20:58, Antoine Pitrou
Someone can take the whole thing over if they want to, change the API and make it more shiny or different, tweak the implementation to suit it better to their own aesthetic sensibilities, but please don't revert an useful feature unless it's based on concrete, serious issues rather than a platonic disagreement about design.
While "It's a useful feature" is a necessary criterion for adding something to the standard library, it has never been a *sufficient* criterion. There's a lot more to take into account when judging the latter, and one of the big ones if "There should be one obvious way to do it". Looking at the current documentation of ElementTree sets of alarm bells on that front, as it contains the following method descriptions for XMLParser: close() Finishes feeding data to the parser. Returns an element structure. feed(data) Feeds data to the parser. data is encoded data. And these for IncrementalParser: data_received(data) Feed the given bytes data to the incremental parser. eof_received() Signal the incremental parser that the data stream is terminated. events() Iterate over the events which have been encountered in the data fed to the parser. This method yields (event, elem) pairs, where event is a string representing the type of event (e.g. "end") and elem is the encountered Element object. Events provided in a previous call to events() will not be yielded again. It is thoroughly unclear to me as a user of the module how and why one would use the new IncrementalParser API over the existing incremental XMLParser API. If there is some defect in the XMLParser API that prevents it from interoperating correctly with asynchronous code, then *that is a bug to be fixed*, rather than avoided by adding a whole new parallel API. If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 25 Aug 2013 00:03:01 +1000
Nick Coghlan
If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way).
To be clear, again: anyone who wants to "see it resolved" can take over the issue and handle it by themselves. I'm done with it. Regards Antoine.
On 25 August 2013 00:13, Antoine Pitrou
On Sun, 25 Aug 2013 00:03:01 +1000 Nick Coghlan
wrote: If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way).
To be clear, again: anyone who wants to "see it resolved" can take over the issue and handle it by themselves. I'm done with it.
OK, I'll revert it for now, then. If someone else steps up to resolve the API duplication problem, cool, otherwise we can continue to live without this as a standard library feature. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 25 August 2013 00:26, Nick Coghlan
On 25 August 2013 00:13, Antoine Pitrou
wrote: On Sun, 25 Aug 2013 00:03:01 +1000 Nick Coghlan
wrote: If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way).
To be clear, again: anyone who wants to "see it resolved" can take over the issue and handle it by themselves. I'm done with it.
OK, I'll revert it for now, then. If someone else steps up to resolve the API duplication problem, cool, otherwise we can continue to live without this as a standard library feature.
On the other hand... because other changes have been made to the module since the original commit, a simple "hg backout" is no longer possible :( Stefan - if you'd like this reverted, you're going to have to either make the alternative solution work correctly, or else craft the commit to undo the API addition. However, I have at least reopened http://bugs.python.org/issue17741 Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, Aug 24, 2013 at 7:33 AM, Nick Coghlan
On 25 August 2013 00:26, Nick Coghlan
wrote: On 25 August 2013 00:13, Antoine Pitrou
wrote: On Sun, 25 Aug 2013 00:03:01 +1000 Nick Coghlan
wrote: If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way).
To be clear, again: anyone who wants to "see it resolved" can take over the issue and handle it by themselves. I'm done with it.
OK, I'll revert it for now, then. If someone else steps up to resolve the API duplication problem, cool, otherwise we can continue to live without this as a standard library feature.
On the other hand... because other changes have been made to the module since the original commit, a simple "hg backout" is no longer possible :(
Stefan - if you'd like this reverted, you're going to have to either make the alternative solution work correctly, or else craft the commit to undo the API addition.
However, I have at least reopened http://bugs.python.org/issue17741
Let's please keep the discussion calm and civil, everyone, and keep things in proportion. This is precisely what alpha releases are for - we have time before beta (Nov 24) to tweak the API. It's a fairly minor feature that *does* appear useful. I agree it would be nice to find an API that's acceptable for more developers. I'll try to find time to review this again, and others are free to do so too. Eli
On Sat, Aug 24, 2013 at 7:33 AM, Nick Coghlan
On 25 August 2013 00:26, Nick Coghlan
wrote: On 25 August 2013 00:13, Antoine Pitrou
wrote: On Sun, 25 Aug 2013 00:03:01 +1000 Nick Coghlan
wrote: If Stefan's "please revert this" as lxml.etree maintainer isn't enough, then I'm happy to add a "please revert this" as a core committer that is confused about how and when the new tulip-inspired incremental parsing API should be used in preference to the existing incremental parsing API, and believes this needs to be clearly resolved before adding a second way to do it (especially if there's a possibility of using a different implementation strategy that avoids adding the second way).
To be clear, again: anyone who wants to "see it resolved" can take over the issue and handle it by themselves. I'm done with it.
OK, I'll revert it for now, then. If someone else steps up to resolve the API duplication problem, cool, otherwise we can continue to live without this as a standard library feature.
On the other hand... because other changes have been made to the module since the original commit, a simple "hg backout" is no longer possible :(
Stefan - if you'd like this reverted, you're going to have to either make the alternative solution work correctly, or else craft the commit to undo the API addition.
I'm strongly opposed to reverting because it cleaned up messy code duplication and actually make the code size smaller. While I agree that the API of incremental parsing should be given another look, IncrementalParser can also be seen as an implementation detail of iterparse(). Thus, it's probably OK to revert the documentation part of the commit to not mention IncrementalParser at all, making it an undocumented internal implementation detail (one of many in this module and elsewhere). However, since we're still in alpha I don't see much point to doing this change now. Let's keep discussing this in the issue. Anyone interested - please make yourself nosy and any feedback on the API is welcome. Eli
Eli Bendersky writes:
I'm strongly opposed to reverting [the change to ElementTree] because it cleaned up messy code duplication and actually make the code size smaller. While I agree that the API of incremental parsing should be given another look, IncrementalParser can also be seen as an implementation detail of iterparse().
Except that its API is familiar and cleaner. Does any current application depend on *not* doing whatever it is that the new API does that IncrementalParser *does* do? If not, why not keep the API of IncrementalParser and shim the new code in under that?
Thus, it's probably OK to revert the documentation part of the commit to not mention IncrementalParser at all,
FWIW, as somebody who can recall using ET exactly one, IncrementalParser is what I used.
On Sat, Aug 24, 2013 at 5:55 PM, Stephen J. Turnbull
Eli Bendersky writes:
I'm strongly opposed to reverting [the change to ElementTree] because it cleaned up messy code duplication and actually make the code size smaller. While I agree that the API of incremental parsing should be given another look, IncrementalParser can also be seen as an implementation detail of iterparse().
Except that its API is familiar and cleaner. Does any current application depend on *not* doing whatever it is that the new API does that IncrementalParser *does* do? If not, why not keep the API of IncrementalParser and shim the new code in under that?
I'm having a difficulty parsing the above. Could you please re-phrase your suggestion?
Thus, it's probably OK to revert the documentation part of the commit to not mention IncrementalParser at all,
FWIW, as somebody who can recall using ET exactly one, IncrementalParser is what I used.
Just to be on the safe side, I want to make sure that you indeed mean IncrementalParser, which was committed 4 months ago into the Mercurial default branch (3.4) and has only seen an alpha release? Eli
Eli Bendersky writes:
On Sat, Aug 24, 2013 at 5:55 PM, Stephen J. Turnbull
wrote:
FWIW, as somebody who can recall using ET exactly once, IncrementalParser is what I used.
Just to be on the safe side, I want to make sure that you indeed mean IncrementalParser, which was committed 4 months ago into the Mercurial default branch (3.4) and has only seen an alpha release? Eli
Oops, and thank you for your courtesy. No, actually looking at the code this time, I meant xml.sax.xmlreader.IncrementalParser, which has the same API as the new etree.ElementTree.IncrementalParser. No wonder it seems familiar. As for the suggestion, AIUI, you proposed keeping the current layering of iterparse on top of IncrementalParser, and then removing Incrementalparser from the documentation. My suggestion is to rename the current "IncrementalParser" class, and then use the IncrementalParser interface for what is currently named "iterparse". Assuming that, as Stefan claims, data_received == feed, and so on. Steve
On 8/24/2013 10:03 AM, Nick Coghlan wrote: I have not used ET or equivalent, but I do have opinions on function names.
Looking at the current documentation of ElementTree sets of alarm bells on that front, as it contains the following method descriptions for XMLParser:
close() Finishes feeding data to the parser. Returns an element structure.
feed(data) Feeds data to the parser. data is encoded data.
These are short active verbs, reused from other Python contexts.
And these for IncrementalParser:
data_received(data) Feed the given bytes data to the incremental parser.
Longer, awkward, and to me ugly in comparison to 'feed'. Since it seems to mean more or less the same thing, why not reuse 'feed' and continue to build on people prior knowledge of Python? Is this the 'tulip inspired' part? If so, I hope the names are not set in stone yet.
eof_received() Signal the incremental parser that the data stream is terminated.
What is the incremental parser supposed to do with the information? Close ;-?
events() Iterate over the events which have been encountered in the data fed to the parser. This method yields (event, elem) pairs, where event is a string representing the type of event (e.g. "end") and elem is the encountered Element object. Events provided in a previous call to events() will not be yielded again.
Plural nouns work well as iterator names: 'for event in events:'. -- Terry Jan Reedy
Le Sat, 24 Aug 2013 14:42:24 -0400,
Terry Reedy
And these for IncrementalParser:
data_received(data) Feed the given bytes data to the incremental parser.
Longer, awkward, and to me ugly in comparison to 'feed'. Since it seems to mean more or less the same thing, why not reuse 'feed' and continue to build on people prior knowledge of Python?
Just because *your* prior knowledge of Python doesn't include event-driven processing using network libraries, doesn't mean it's a completely new and unknown thing to other people. There are reasons why "data_received" is better (less ambiguous) than "feed". If you want to influence tulip's design, however, this is the wrong mailing-list to do so.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/26/2013 04:36 AM, Antoine Pitrou wrote:
event-driven processing using network librarie
Maybe I missed something: why should considerations from that topic influence the design of an API for XML processing? 'feed' and 'close' make much more sense for a parser API, as well has having the benefit of long usage. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlIbSRoACgkQ+gerLs4ltQ5lhwCgnG7TLgSkVf+gXSOxO1KP2kLC eLwAn1QbqbHUqJ7bKV6us/nDQ79AYUgk =aN8S -----END PGP SIGNATURE-----
Le Mon, 26 Aug 2013 08:24:58 -0400,
Tres Seaver
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 08/26/2013 04:36 AM, Antoine Pitrou wrote:
event-driven processing using network librarie
Maybe I missed something: why should considerations from that topic influence the design of an API for XML processing?
Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive. ((*) "data" ... "received"; does it ring a bell? ;-)) If you want iterative processing from a fast data source, you can already use iterparse(): it's blocking, but it's not a problem with disk I/O (not to mention that non-blocking disk I/O doesn't really exist under Linux, AFAIK: I haven't been able to get EAGAIN with os.read() on a non-blocking file, even when reading from a huge uncached file). The whole *point* of adding IncrementalParser was to parse incoming XML data in a way that is friendly with event-driven network programming, other use cases being *already* covered by existing APIs. This is why it's far from nonsensical to re-use an existing terminology from that world. If you don't do any non-blocking network I/O, then fine - you won't even need the API, and can safely ignore its existence.
On Mon, Aug 26, 2013 at 2:51 PM, Antoine Pitrou
Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive.
It looks like all the events have to be ready before one can start iterating over .events() in the new API? That doesn't seem that useful from an asynchronous programming perspective and .data_received() and .eof_received() appear to be thin wrappers over .feed() and .close()? Am I misunderstanding something?
Le Mon, 26 Aug 2013 17:44:41 +0200,
Simon Cross
On Mon, Aug 26, 2013 at 2:51 PM, Antoine Pitrou
wrote: Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive.
It looks like all the events have to be ready before one can start iterating over .events() in the new API? That doesn't seem that useful from an asynchronous programming perspective and .data_received() and .eof_received() appear to be thin wrappers over .feed() and .close()?
What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking) Implementation-wise, data_received() and eof_received() are not thin wrappers over feed() and close(), they rely on an internal API to get at the generated events (which justifies putting the functionality inside the etree module, by the way). Regards Antoine.
On Mon, Aug 26, 2013 at 8:57 AM, Antoine Pitrou
Le Mon, 26 Aug 2013 17:44:41 +0200, Simon Cross
a écrit : On Mon, Aug 26, 2013 at 2:51 PM, Antoine Pitrou
wrote: Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive.
It looks like all the events have to be ready before one can start iterating over .events() in the new API? That doesn't seem that useful from an asynchronous programming perspective and .data_received() and .eof_received() appear to be thin wrappers over .feed() and .close()?
What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking)
Implementation-wise, data_received() and eof_received() are not thin wrappers over feed() and close(), they rely on an internal API to get at the generated events (which justifies putting the functionality inside the etree module, by the way).
Antoine, you opted out of the tracker issue but I feel it's fair to let you know that after a lot of discussion with Nick and Stefan (*), we've settled on renaming the input methods to feed & close, and the output method to read_events. We are also considering a different name for the class. I've posted with more detail and rationale in http://bugs.python.org/issue17741, but to summarize: The input-side of IncrementalParser is the same as the plain XMLParser. The latter can also be given data incrementally by means of "feed". By default it would collect the whole tree and return it in close(), but in reality you can rig a custom target that does something more fluid (though not to the full extent of IncrementalParser). Therefore it was deemed confusing to have different names for this. Another reason is consistency with xml.sax.xmlreader.IncrementalParser, which also has feed() and close(). As for the output method name, Nick suggested that read_events conveys the destructive nature of the method better (by analogy to file/stream APIs), and others agreed. As for the class name, IncrementalParser is ambiguous because it's not immediately clear which side is incremental. Input or output? For the input, it's no more incremental than XMLParser itself, as stated above. The output is what's different here, so we're considering a few candidates for a better name that conveys the meaning more precisely. And to reiterate, I realize that it's unpleasant for you to have this dug up after it has already been committed. I assume the blame for not reviewing it in more detail originally. However, I feel it would still be better to revise this now than just leave it be. APIs added to stdlib are cooked in there for a *long time*. Alternatively, Nick suggested granting this API a "provisional" status (PEP 411), and that's an option if we don't manage to reach some sort of consensus. Eli (*) Well, to be completely precise, Stefan is still opposed to the whole idea.
Le Mon, 26 Aug 2013 09:14:38 -0700,
Eli Bendersky
Antoine, you opted out of the tracker issue but I feel it's fair to let you know that after a lot of discussion with Nick and Stefan (*), we've settled on renaming the input methods to feed & close, and the output method to read_events. We are also considering a different name for the class.
Fair enough.
As for the class name, IncrementalParser is ambiguous because it's not immediately clear which side is incremental. Input or output?
Both are :-) (which makes sense, really: an incremental input without output will only yield a slight memory consumption benefit - only slight, since the object tree representation should be much more costly than its bytes serialization -; an incremental output without input doesn't seem to have any point at all) Regards Antoine.
On Mon, Aug 26, 2013 at 9:21 AM, Antoine Pitrou
Le Mon, 26 Aug 2013 09:14:38 -0700, Eli Bendersky
a écrit : Antoine, you opted out of the tracker issue but I feel it's fair to let you know that after a lot of discussion with Nick and Stefan (*), we've settled on renaming the input methods to feed & close, and the output method to read_events. We are also considering a different name for the class.
Fair enough.
As for the class name, IncrementalParser is ambiguous because it's not immediately clear which side is incremental. Input or output?
Both are :-)
Yes, exactly :-) "Incremental", though, seems to support the conjecture that it's the input. Which is true, but, since XMLParser is also "incremental" in this sense, slightly confusing. As a more anecdotal piece of evidence: when the issue was reopened, I myself at first got confused by exactly this point because I forgot all about this in the months that passed since the commit. And I recalled that when I initially reviewed your patch, I got confused too :-) That would suggest one of two things: (1) The name is indeed confusing or (2) I'm stupid. The fact that Nick also got confused when trying a cursory understanding of the documentation cams me down w.r.t. (2). Back to the discussion, my new favorite is NonblockingParser. Because its input side is exactly similar to XMLParser, the "Nonblocking" in the name points to the difference in output, which is correct. As the popular quote says, "There are only two hard problems in Computer Science: cache invalidation and naming things." Eli
On 26 August 2013 17:40, Eli Bendersky
Yes, exactly :-) "Incremental", though, seems to support the conjecture that it's the input. Which is true, but, since XMLParser is also "incremental" in this sense, slightly confusing.
As a data point, until you explained the difference between the two classes earlier in this thread, I too had been completely confused as both the existing and the new classes are "incremental" (on the input side - that's what I interpret "incremental" as meaning). It never even occurred to me that the difference was in the *output* side. Maybe "NonBlocking" would imply that to me. Or maybe "Generator". But regardless, I think the changes you've made sound good, and I'm certainly less concerned with the new version(as someone who will likely never use the new API, and therefore doesn't really have a vote). Paul
On Mon, Aug 26, 2013 at 10:40 AM, Paul Moore
On 26 August 2013 17:40, Eli Bendersky
wrote: Yes, exactly :-) "Incremental", though, seems to support the conjecture that it's the input. Which is true, but, since XMLParser is also "incremental" in this sense, slightly confusing.
As a data point, until you explained the difference between the two classes earlier in this thread, I too had been completely confused as both the existing and the new classes are "incremental" (on the input side - that's what I interpret "incremental" as meaning). It never even occurred to me that the difference was in the *output* side. Maybe "NonBlocking" would imply that to me. Or maybe "Generator". But regardless, I think the changes you've made sound good, and I'm certainly less concerned with the new version(as someone who will likely never use the new API, and therefore doesn't really have a vote).
Thanks for the data point; it is useful.
How about StreamParser?
The problem with StreamParser is similar to IncrementalParser. "Stream" carries the impression that it refers to the input. But the input of ET parsers is *always* streaming, in a way (the feed/close interface). I want a name that conveys that the *output* is also nonblocking/streaming/yielding/generating/etc. Therefore Nonblocking (I'll let better English experts to decide whether B should be capitalized) sounds better to me, because it helps convey that both sides of the parser are asynchronous. Eli
Nonblocking sounds too Internet-related. How about...flow?
Ah, I'll probably still end up using Expat regardless.
Eli Bendersky
On Mon, Aug 26, 2013 at 10:40 AM, Paul Moore
wrote:
On 26 August 2013 17:40, Eli Bendersky
wrote: Yes, exactly :-) "Incremental", though, seems to support the conjecture that it's the input. Which is true, but, since XMLParser is also "incremental" in this sense, slightly confusing.
As a data point, until you explained the difference between the two classes earlier in this thread, I too had been completely confused as both the existing and the new classes are "incremental" (on the input side
that's what I interpret "incremental" as meaning). It never even occurred to me that the difference was in the *output* side. Maybe "NonBlocking" would imply that to me. Or maybe "Generator". But regardless, I think the changes you've made sound good, and I'm certainly less concerned with the new version(as someone who will likely never use the new API, and therefore doesn't really have a vote).
Thanks for the data point; it is useful.
How about StreamParser?
The problem with StreamParser is similar to IncrementalParser. "Stream" carries the impression that it refers to the input. But the input of ET parsers is *always* streaming, in a way (the feed/close interface). I want a name that conveys that the *output* is also nonblocking/streaming/yielding/generating/etc. Therefore Nonblocking (I'll let better English experts to decide whether B should be capitalized) sounds better to me, because it helps convey that both sides of the parser are asynchronous.
Eli
------------------------------------------------------------------------
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Paul Moore, 26.08.2013 19:40:
On 26 August 2013 17:40, Eli Bendersky wrote:
Yes, exactly :-) "Incremental", though, seems to support the conjecture that it's the input. Which is true, but, since XMLParser is also "incremental" in this sense, slightly confusing.
As a data point, until you explained the difference between the two classes earlier in this thread, I too had been completely confused as both the existing and the new classes are "incremental" (on the input side - that's what I interpret "incremental" as meaning). It never even occurred to me that the difference was in the *output* side.
The fix I'm proposing is to not make it two separate classes. But those who are interested in the details should really participate in the ticket discussion rather than here. Stefan
How about StreamParser? I mean, even if it isn't quite the same, that name would still make sense.
Eli Bendersky
On Mon, Aug 26, 2013 at 8:57 AM, Antoine Pitrou
wrote: Le Mon, 26 Aug 2013 17:44:41 +0200, Simon Cross
a écrit : On Mon, Aug 26, 2013 at 2:51 PM, Antoine Pitrou
wrote: Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive.
It looks like all the events have to be ready before one can start iterating over .events() in the new API? That doesn't seem that useful from an asynchronous programming perspective and .data_received() and .eof_received() appear to be thin wrappers over .feed() and .close()?
What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking)
Implementation-wise, data_received() and eof_received() are not thin wrappers over feed() and close(), they rely on an internal API to get at the generated events (which justifies putting the functionality inside the etree module, by the way).
Antoine, you opted out of the tracker issue but I feel it's fair to let you know that after a lot of discussion with Nick and Stefan (*), we've settled on renaming the input methods to feed & close, and the output method to read_events. We are also considering a different name for the class.
I've posted with more detail and rationale in http://bugs.python.org/issue17741, but to summarize:
The input-side of IncrementalParser is the same as the plain XMLParser. The latter can also be given data incrementally by means of "feed". By default it would collect the whole tree and return it in close(), but in reality you can rig a custom target that does something more fluid (though not to the full extent of IncrementalParser). Therefore it was deemed confusing to have different names for this. Another reason is consistency with xml.sax.xmlreader.IncrementalParser, which also has feed() and close().
As for the output method name, Nick suggested that read_events conveys the destructive nature of the method better (by analogy to file/stream APIs), and others agreed.
As for the class name, IncrementalParser is ambiguous because it's not immediately clear which side is incremental. Input or output? For the input, it's no more incremental than XMLParser itself, as stated above. The output is what's different here, so we're considering a few candidates for a better name that conveys the meaning more precisely.
And to reiterate, I realize that it's unpleasant for you to have this dug up after it has already been committed. I assume the blame for not reviewing it in more detail originally. However, I feel it would still be better to revise this now than just leave it be. APIs added to stdlib are cooked in there for a *long time*. Alternatively, Nick suggested granting this API a "provisional" status (PEP 411), and that's an option if we don't manage to reach some sort of consensus.
Eli
(*) Well, to be completely precise, Stefan is still opposed to the whole idea.
------------------------------------------------------------------------
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
On Mon, Aug 26, 2013 at 5:57 PM, Antoine Pitrou
What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking)
So you have to poll .events()? That also seems unhelpful from an event driven programming perspective. What I'm driving at is that I'd expect to have access to some sort of deferred that fires when the next event is ready to be processed and I don't see that here.
Le Tue, 27 Aug 2013 09:58:57 +0200,
Simon Cross
On Mon, Aug 26, 2013 at 5:57 PM, Antoine Pitrou
wrote: What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking)
So you have to poll .events()? That also seems unhelpful from an event driven programming perspective.
At most, you're gonna poll events() after calling data_received() (there's no other way some events can be generated, after all). You can also poll it less often, depending on how often you're interested in new events. That is, it is amenable to both push and pull modes.
What I'm driving at is that I'd expect to have access to some sort of deferred that fires when the next event is ready to be processed and I don't see that here.
That would be sensible to do if Deferred was a construct shared amongst major async frameworks, but it isn't ;-) (and it looks like Guido won't include a Deferred-alike in PEP 3156) Letting people "poll" events() lets them plug any async-callback-firing primitive they like, almost trivially. Regards Antoine.
Simon Cross, 27.08.2013 09:58:
On Mon, Aug 26, 2013 at 5:57 PM, Antoine Pitrou wrote:
What do you mean, "all events have to be ready"? If you look at the unit tests, the events are generated on-the-fly, not at the end of the document. (exactly the same as iterparse(), except that iterparse() is blocking)
So you have to poll .events()? That also seems unhelpful from an event driven programming perspective.
What I'm driving at is that I'd expect to have access to some sort of deferred that fires when the next event is ready to be processed and I don't see that here.
The idea is that you pass data into the parser and then ask read_events() for an event iterator. If/When that's empty, you're done. No repeated polling or anything, just all in one shot whenever data is available. It's a really nice interface by design, just badly integrated into the existing API. Stefan
On 8/26/2013 8:51 AM, Antoine Pitrou wrote:
Le Mon, 26 Aug 2013 08:24:58 -0400, Tres Seaver
a écrit : On 08/26/2013 04:36 AM, Antoine Pitrou wrote:
event-driven processing using network libraries
Maybe I missed something: why should considerations from that topic influence the design of an API for XML processing?
Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive. ... The whole *point* of adding IncrementalParser was to parse incoming XML data in a way that is friendly with event-driven network programming, other use cases being *already* covered by existing APIs. This is why it's far from nonsensical to re-use an existing terminology from that world.
Since when is Tulip the OOWTDI? If this was Twisted, it would be "write" and "finish"[1]. Tulip's Protocol ABC isn't even a good match for the application. There is reason that Twisted has a separate Consumer/Producer interface from the network I/O interface. I'm sure there is other existing practice in this specific area too (e.g., XMLParser). [1] http://twistedmatrix.com/documents/13.1.0/api/twisted.protocols.ftp.IFinisha... -- Scott Dial scott@scottdial.com
On Mon, 26 Aug 2013 23:45:55 -0400
Scott Dial
On 8/26/2013 8:51 AM, Antoine Pitrou wrote:
Le Mon, 26 Aug 2013 08:24:58 -0400, Tres Seaver
a écrit : On 08/26/2013 04:36 AM, Antoine Pitrou wrote:
event-driven processing using network libraries
Maybe I missed something: why should considerations from that topic influence the design of an API for XML processing?
Because this API is mostly useful when the data is received (*) at a slow enough speed - which usually means from the network, not from a hard drive. ... The whole *point* of adding IncrementalParser was to parse incoming XML data in a way that is friendly with event-driven network programming, other use cases being *already* covered by existing APIs. This is why it's far from nonsensical to re-use an existing terminology from that world.
Since when is Tulip the OOWTDI? If this was Twisted, it would be "write" and "finish"[1]. Tulip's Protocol ABC isn't even a good match for the application. There is reason that Twisted has a separate Consumer/Producer interface from the network I/O interface. I'm sure there is other existing practice in this specific area too (e.g., XMLParser).
I'm really not convinced further bikeshedding on this issue has any point. If you have any concrete concerns, you can voice them on the issue tracker. Regards Antoine.
participants (13)
-
Antoine Pitrou
-
Eli Bendersky
-
Greg Ewing
-
Nick Coghlan
-
Paul Moore
-
R. David Murray
-
Ryan
-
Scott Dial
-
Simon Cross
-
Stefan Behnel
-
Stephen J. Turnbull
-
Terry Reedy
-
Tres Seaver