Hi,
Mike Meyer wrote:
> On Sat, 14 Jul 2007 13:17:24 +0200 Stefan Behnel <stefan_ml(a)behnel.de> wrote:
>> Mike Meyer wrote:
>>> I think I have a very, very basic bug here:
>>> Note that the original ElementTree implementation only returns
>>> children that are *elements*, whereas the lxml version returns all
>>> the children. This makes life much more interesting, especially as
>>> there isn't an obvious method for checking whether or not the node
>>> value is actually an element.
>> Sure there is, just check for the tag property being a string. lxml.etree is
>> compatible to ElementTree in that it returns the factory functions for
>> everything that does not have a tag (comments, PIs, entities).
>> >>> [el for el in n if isinstance(el.tag, basestring)]
>
> Obvious is relative. I did find that test, but this doesn't say "I'm
> processing only element children". You have to know that those are the
> only children an element can have for which the tag attribute is a
> string. Maybe that's obvious to you, it certainly wasn't to me, and
> probably isn't to the casual reader either. Not when compared to
> something like:
>
> ***** WARNING: CODE FOR EXPOSITION ONLY. THESE DO NOT WORK ******
>
> [el for el in n if isinstance(el, etree.Element)]
> [el for el in n if etree.is_element(el)]
> [el for el in n if el.type == etree.ELEMENT_TYPE]
>
> (or any of a dozen ways that says "n is an element" as opposed to
> saying "some attribute of n is some other type that the two examples
> we're looking at only use for that attribute on the type of interest.")
Well, I admit that it's not the most obvious thing ever and that the lxml docs
do not provide obvious advise here. This is definitely something we can improve.
Still, that's how ElementTree works. So, if you start by comparing
lxml.etree's behaviour to ElementTree and claim that we "have a very, very
basic bug here", you may have to accept that someone tells you that it's not a
bug in lxml.etree.
Note that ElementTree only shows this behaviour because its parser silently
drops comments and processing instructions completely. If you constructed the
same tree through the API, the behaviour of lxml.etree and ElementTree would
be identical for what you tested and you would run into the same problem with
both.
>>> import xml.etree.ElementTree as ET
>>> root = ET.XML("<root><?pi test?><b/></root>")
>>> root.getchildren()
[<Element b at b79e294c>]
>>> root.append(ET.PI("test"))
>>> root.getchildren()
[<Element b at b79e294c>, <Element <function ProcessingInstruction at
0xb79e3cdc> at b79dd04c>]
The difference is that lxml.etree also behaves this way for a tree that was
parsed. I find this rather consistent.
>>> import lxml.etree as et
>>> et.XML("<root><?pi test?><b/></root>")
<Element root at b79d4504>
>>> root = et.XML("<root><?pi test?><b/></root>")
>>> root.getchildren()
[<?pi test?>, <Element b at b79d457c>]
>> should do what you want, in both lxml.etree and ElementTree.
>
> Right. But I don't need the extra filter in ElementTree - it returns
> just the element children and that's what it's documented as doing. I
> quite literally found this when I took working ElementTree code and
> tried to move it to lxml's implementation of ElementTree (and then
> moved back to ElementTree rather than use test on the tag attributes
> type).
Hmm, I see your point. Maybe we should provide an ElemenTree compatible parser
that strips comments and PIs from the document. That would at least help in
porting existing code.
> And this doesn't help at all if I want to distinguish between PIs and
> comments.
Well, read the ElementTree docs (or re-read my last mail). For elements, the
tag property returns the tag name, for everything else, it returns the
respective *factory function*, i.e. etree.Comment or
etree.ProcessingInstruction. So, testing what you have if it's not an Element
is actually not more than a straight "is".
>> I don't see why this should be a bug, it's just an extended tree model.
>> Comments are nothing to frown upon, they are as much part of the XML world as
>> element nodes, so why would you want to ignore them?
>
> You want to ignore them because you're working with an API that
> ignores them. If ElementTree returned them, then I wouldn't consider
> it a bug.
ElementTree *does* return them if they are in the tree. It's just the parser
that does not put them in the tree. So it actually encourages you to write
code that depends on the way the tree was constructed. If, one day, you feed
one of your functions with a tree containing Comments that were added through
the API, it will just stop working and you will have a hard time figuring out
why your perfectly working and not-touched-in-a-long-time function fails for a
certain subset of the input.
> However, the single most common thing to want to do when
> processing the children of a node is to recurse on the element
> children. ElementTree makes that easy by only exposing the children
> that are elements via the list API, as otherwise you wind up having to
> write more code for the most common use case.
>
> Even if you don't want to ignore them, I've never seen a case where
> you wanted to do the same things to all the children of a node. So
> there should be an easy way to figure out what kind of child you're
> dealing with, if nothing else.
>
> Basically, I think this is a bug in the design of the lxml extensions
> to the ElementTree API. Rather than extending the API, lxml changes
> the API by adding comments and PIs to the list interface. This isn't
> mentioned in any of the lxml.etree documentation, other than one
> paragraph on the compatibility page that implies that this might
> happen. As an aside, lxml doesn't add the other children of an element
> to that interface, though it would make as much sense to do so. This
> is presumably because the ElementTree API already has good interfaces
> for dealing with text and attribute children.
... and Elements and PIs and Comments, which all behave mostly the same at the
API level of lxml.etree and ElementTree.
> That same paragraph notes that you can enable ignoring comments by
> tweaking the parser (but doesn't deal with ignoring PIs).
True, as I said, an ET compatibility parser would help here.
> If the goal
> is to be compatible with ElementTree, this is wrong - the default mode
> should be the one that's compatible, and getting the incompatible mode
> should take extra work. If the goal is to be easy-to-use, then it's
> still wrong - the default mode should be the more common use case
> (though what's more common clearly depends on your environment).
Ok, but your proposal is based on the wrong assumption that it's the API that
shows this behaviour in ET. But since it's the parser, being compatible would
mean drop PIs and comments by default, thus changing the document behind the
scenes in an I/O cycle. I think this is the wrong behaviour for a default.
> Even if you don't agree that this is the most common use case, being
> easy-to-use means there should be a way to check for each of the three
> types in the list API that's obvious when you read it and easy to find
> by looking at the class docstrings. Of course, doing this breaks
> compatibility with other ElementTree implementations, but we punted on
> that when we added extra children to the list API.
I assume you read carefully up to this point, so I won't need to comment on this.
> My gut reaction is that it would be better to actually extend the API
> rather than changing it. For example, have get_children, getiterator,
> iterchildren, find, and so on accept extra keyword arguments to
> indicate it will *not* (won't because we're extending the ElementTree
> API, which only provides "will") filter PIs or comments (or even
> elements, thought that one has the opposite default). However, that's
> just a first thought on the issue.
At least for all the iterator functions, you can pass a "tag" argument. If you
pass nothing, you will get all the nodes. If you pass "*", you will get only
Elements (i.e. nodes that have any tag name). If you pass one of the factory
functions, you will get only those. Currently, you cannot pass the Element
function (which is ET compatible behaviour), but maybe that would be a nice
and consistent alternative to passing "*".
Stefan