[lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes
data:image/s3,"s3://crabby-images/e95e4/e95e4a506928b32e2459f3b11d387a918c6f4baa" alt=""
The nil node <Fubar/> is not deannotated as I would expect in the following snippet. I could not find a reference to this behaviour in the archives or documentation. Is this a design feature for which there is a work around, or a bug? I'm using lxml-2.2-py2.5-linux-i686. Thanks! #### CODE #### import lxml.etree import lxml.objectify x = lxml.objectify.fromstring('<root><Bar/></root>') x.Foo = '' x.Fubar = None lxml.objectify.deannotate(x) lxml.etree.cleanup_namespaces(x) print lxml.etree.tostring(x) #### END CODE ### <root><Bar/><Foo></Foo><Fubar xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/></root>
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Design feature. Only py:pytype/xsi:type attributes get removed by deannotate():
Help on built-in function deannotate in module lxml.objectify: deannotate(...) deannotate(element_or_tree, pytype=True, xsi=True) Recursively de-annotate the elements of an XML tree by removing 'pytype' and/or 'type' attributes. If the 'pytype' keyword argument is True (the default), 'pytype' attributes will be removed. If the 'xsi' keyword argument is True (the default), 'xsi:type' attributes will be removed. IMHO the xsi:nil concept in XML Schema pretty much corresponds to NULL values in databases, i.e. a typed element/column may (or may not) be xsi:nil/NULL, but it does not so directly translate to the distinct Python None object. OTOH I think mapping xsi:nil to None very much captures the meaning of xsi:nil/NULL, because in most use cases you'd test if a value has been set (!=None) or not (==None). Or course, you can always easily get rid of xsi:nil if you wish so:
for elt in root.iter(): elt.attrib.pop('{http://www.w3.org/2001/XMLSchema-instance}nil', None)
Holger -- Nur bis 31.05.: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Holger wrote:
I'd be a little more careful with such a big word. ;)
Yes, so it's even implicitly documented. :) Anyway, I'm not sure it's always a good idea to leave this special case in instead of cleaning everything up. I think if you remove it, you'd get an empty string result, which may be surprising - but more surprising than not getting it cleaned up? After all, deannotate() means deannotate()... Stefan
data:image/s3,"s3://crabby-images/e95e4/e95e4a506928b32e2459f3b11d387a918c6f4baa" alt=""
Thanks! That answers my questions. The apparent asymmetry of handling nodes was confusing, but the distinction of pytypes vs xsi makes some sense. I would naively agree that a seemingly general purpose function like deannotate should remove everything. Otherwise, I have to walk the tree twice: once with deannotate and once to unlink remaining nill types. Or recreate my own deannotate(). Not a big deal either way, though. On Tue, Jun 2, 2009 at 12:24 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Well, it's definitely not a bug :)
But deannotate() cares about type attributes and nil is not exactly a type attribute. We annotate the tree to have help in mapping to proper Python types, but xsi:nil can well show up in any non-annotated document. Of course, we make *use* of it for the type lookup system, both by interpreting it if it's there and by setting it for None assignment, but that still does not make it a type annotation attribute IMHO. Consider this use case:
I wouldn't want deannotate() to remove xsi:nil here. What's the use case for a deannotate() that removes xsi:nil? Why not just assign '' instead of None and deannotate() afterwards? A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :) Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, I do see your point that xsi:nil is still a bit different from xsi:type. That's why I had my doubts in the first place. jholg@gmx.de wrote:
A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :)
I think that should be done, yes. A "nil=False" keyword would nicely solve this. And disabling it by default makes sense for two reasons: backwards compatibility and the fact that xsi:nil may be used in existing documents. Is a plain "nil" enough or should we use "xsi_nil"? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I think xsi_nil is clearer. What if we add a general deannotation function that lets you strip a tree off arbitrary attributes? Something like def remove_attributes(element_or_tree, *attrs): ... which takes either ns-qualified strings or (ns, attrname) tuples and removes these attributes wherever found. objectify.deannotate() would then be a special case of this and share the implementation. Then again maybe that's overkill... Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de wrote:
Thought so, too.
That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan
data:image/s3,"s3://crabby-images/318d0/318d04c7ebb87fcdafbbb37fdece1cb8a42775e9" alt=""
My comments would be: brilliant, useful, wonderful! However should the last one read... strip_tags(tree, *tag_names) John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Stefan Behnel Sent: Thursday, June 04, 2009 6:34 AM To: jholg@gmx.de Cc: lxml-dev@codespeak.net Subject: Re: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes jholg@gmx.de wrote:
Thought so, too.
That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Robert Pierce wrote:
Done: https://codespeak.net/viewvc/?view=rev&revision=65612 https://codespeak.net/viewvc/lxml/trunk/src/lxml/cleanup.pxi?view=markup&pathrev=65612
Since you two seem to be very happy about this feature, what about writing up some docs/doctests for it? A new section here sounds like the right place: http://codespeak.net/svn/lxml/trunk/doc/api.txt -> http://codespeak.net/lxml/api.html Maybe the tutorial could also benefit from a short reference. Holger, could you replace the current deannotate() implementation in lxml.objectify and add the xsl:nil cleanup option as we discussed? I expect it to be a little slower than before due to the more general implementation. If you have some code at your hands to benchmark it, please do. Unless Ian (or someone else) beats me to it, I'll also look through lxml.html next week to check for places where this can be used. For example, clean.py looks like an obvious candidate. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Done: https://codespeak.net/viewvc/?view=rev&revision=65680 No benchmarking yet, though. Holger -- GMX FreeDSL mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.! http://dslspecial.gmx.de/freedsl-aktionspreis/?ac=OM.AD.PD003K11308T4569a
data:image/s3,"s3://crabby-images/318d0/318d04c7ebb87fcdafbbb37fdece1cb8a42775e9" alt=""
Stefan: Has their been any action on this? I really want to help and I am really swamped. Sorry, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: Stefan Behnel [mailto:stefan_ml@behnel.de] Sent: Saturday, June 06, 2009 1:49 AM To: Robert Pierce; John Lovell Cc: lxml-dev@codespeak.net Subject: Re: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes Hi, Robert Pierce wrote:
Done: https://codespeak.net/viewvc/?view=rev&revision=65612 https://codespeak.net/viewvc/lxml/trunk/src/lxml/cleanup.pxi?view=markup&pathrev=65612
Since you two seem to be very happy about this feature, what about writing up some docs/doctests for it? A new section here sounds like the right place: http://codespeak.net/svn/lxml/trunk/doc/api.txt -> http://codespeak.net/lxml/api.html Maybe the tutorial could also benefit from a short reference. Holger, could you replace the current deannotate() implementation in lxml.objectify and add the xsl:nil cleanup option as we discussed? I expect it to be a little slower than before due to the more general implementation. If you have some code at your hands to benchmark it, please do. Unless Ian (or someone else) beats me to it, I'll also look through lxml.html next week to check for places where this can be used. For example, clean.py looks like an obvious candidate. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I suspected so but wasn't sure about the lxml.etree policy with regard to extending the elementtree API, apart from obvious libxml2/libxslt superpowers.
+1 for strip_*.
+1 for functions. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/e95e4/e95e4a506928b32e2459f3b11d387a918c6f4baa" alt=""
In case it isn't obvious, I'm not an XML guru and haven't been using lxml for long, but truly IMHO: I stipulate the importance of nil (or null) in schema definitions, as well as in attaching types to the in memory representation of the tree. But from the standpoint of text representation, <foo xsi:nil='true'/> doesn't seem to carry any additional information over <foo/>. My use case is passing XML through SQS, which has an upper bound of about 6kB (after http headers are accounted for). When lxml annotates empty elements, it attaches BOTH schema and type to each node, which increases the size of the text representation of the element by a factor of 4 or more. So I really have to deannotate it "all the way". On 6/2/09, jholg@gmx.de <jholg@gmx.de> wrote:
I think it is impossible to retain input intent once a tree is parsed into memory. Really, in the absence of a schema I shouldn't be able to tell the difference between your input and root = objectify.fromstring('<root><x/></root>') or root = objectify.fromstring('<root/>') root.x = None You can only ask for consistency on output. Currently, the output of deannotate is not consistent in this case. In any event, type constraints are more properly defined in a schema, aren't they? Just because you passed me <root><x xsi:nil='true'/></root> doesn't constrain me from passing you back <root><x><y/></x></root> unless there's a schema that says otherwise.
What's the use case for a deannotate() that removes xsi:nil? Why not just assign '' instead of None and deannotate() afterwards?
As you suggest, I can set the element value to '', so it is a string type and deannotate() removes the type. However, tostring() + deannotate() then produces <foo></foo> rather than <foo/>... better, but still not efficient. Of course, there is a valid argument to say that a space constrained API shouldn't use a bloated data format like XML at all, but (for my API) it's too late to make that argument.
A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :)
Works for me!
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Robert Pierce wrote:
I think it makes more sense to let an empty leaf element represent an empty string than to represent it as None. It's a matter of use cases, obviously.
My use case is passing XML through SQS
"SQS" is an ambiguous abbreviation.
which has an upper bound of about 6kB (after http headers are accounted for).
That sounds like a rather odd restriction. Doesn't it at least support compression?
Well, it /is/ different, though. >>> root = objectify.fromstring('<root><x/></root>') >>> str(root.x) '' >>> root = objectify.fromstring('<root/>') >>> root.x = None >>> str(root.x) 'None'
You can only ask for consistency on output.
No, lxml.objectify is a Python-object-like in-memory tree. Serialisation is only a way out, validation only a way to check what leaves the code that processed the tree. All the rest is about making it easy to use as a tree structure. That's what the annotations are there for. If you want to keep the necessary information during a serialise-parse cycle or not is up to you (or should be, so an option to remove everything is just fine). Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Design feature. Only py:pytype/xsi:type attributes get removed by deannotate():
Help on built-in function deannotate in module lxml.objectify: deannotate(...) deannotate(element_or_tree, pytype=True, xsi=True) Recursively de-annotate the elements of an XML tree by removing 'pytype' and/or 'type' attributes. If the 'pytype' keyword argument is True (the default), 'pytype' attributes will be removed. If the 'xsi' keyword argument is True (the default), 'xsi:type' attributes will be removed. IMHO the xsi:nil concept in XML Schema pretty much corresponds to NULL values in databases, i.e. a typed element/column may (or may not) be xsi:nil/NULL, but it does not so directly translate to the distinct Python None object. OTOH I think mapping xsi:nil to None very much captures the meaning of xsi:nil/NULL, because in most use cases you'd test if a value has been set (!=None) or not (==None). Or course, you can always easily get rid of xsi:nil if you wish so:
for elt in root.iter(): elt.attrib.pop('{http://www.w3.org/2001/XMLSchema-instance}nil', None)
Holger -- Nur bis 31.05.: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Holger wrote:
I'd be a little more careful with such a big word. ;)
Yes, so it's even implicitly documented. :) Anyway, I'm not sure it's always a good idea to leave this special case in instead of cleaning everything up. I think if you remove it, you'd get an empty string result, which may be surprising - but more surprising than not getting it cleaned up? After all, deannotate() means deannotate()... Stefan
data:image/s3,"s3://crabby-images/e95e4/e95e4a506928b32e2459f3b11d387a918c6f4baa" alt=""
Thanks! That answers my questions. The apparent asymmetry of handling nodes was confusing, but the distinction of pytypes vs xsi makes some sense. I would naively agree that a seemingly general purpose function like deannotate should remove everything. Otherwise, I have to walk the tree twice: once with deannotate and once to unlink remaining nill types. Or recreate my own deannotate(). Not a big deal either way, though. On Tue, Jun 2, 2009 at 12:24 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Well, it's definitely not a bug :)
But deannotate() cares about type attributes and nil is not exactly a type attribute. We annotate the tree to have help in mapping to proper Python types, but xsi:nil can well show up in any non-annotated document. Of course, we make *use* of it for the type lookup system, both by interpreting it if it's there and by setting it for None assignment, but that still does not make it a type annotation attribute IMHO. Consider this use case:
I wouldn't want deannotate() to remove xsi:nil here. What's the use case for a deannotate() that removes xsi:nil? Why not just assign '' instead of None and deannotate() afterwards? A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :) Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, I do see your point that xsi:nil is still a bit different from xsi:type. That's why I had my doubts in the first place. jholg@gmx.de wrote:
A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :)
I think that should be done, yes. A "nil=False" keyword would nicely solve this. And disabling it by default makes sense for two reasons: backwards compatibility and the fact that xsi:nil may be used in existing documents. Is a plain "nil" enough or should we use "xsi_nil"? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I think xsi_nil is clearer. What if we add a general deannotation function that lets you strip a tree off arbitrary attributes? Something like def remove_attributes(element_or_tree, *attrs): ... which takes either ns-qualified strings or (ns, attrname) tuples and removes these attributes wherever found. objectify.deannotate() would then be a special case of this and share the implementation. Then again maybe that's overkill... Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de wrote:
Thought so, too.
That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan
data:image/s3,"s3://crabby-images/318d0/318d04c7ebb87fcdafbbb37fdece1cb8a42775e9" alt=""
My comments would be: brilliant, useful, wonderful! However should the last one read... strip_tags(tree, *tag_names) John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Stefan Behnel Sent: Thursday, June 04, 2009 6:34 AM To: jholg@gmx.de Cc: lxml-dev@codespeak.net Subject: Re: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes jholg@gmx.de wrote:
Thought so, too.
That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Robert Pierce wrote:
Done: https://codespeak.net/viewvc/?view=rev&revision=65612 https://codespeak.net/viewvc/lxml/trunk/src/lxml/cleanup.pxi?view=markup&pathrev=65612
Since you two seem to be very happy about this feature, what about writing up some docs/doctests for it? A new section here sounds like the right place: http://codespeak.net/svn/lxml/trunk/doc/api.txt -> http://codespeak.net/lxml/api.html Maybe the tutorial could also benefit from a short reference. Holger, could you replace the current deannotate() implementation in lxml.objectify and add the xsl:nil cleanup option as we discussed? I expect it to be a little slower than before due to the more general implementation. If you have some code at your hands to benchmark it, please do. Unless Ian (or someone else) beats me to it, I'll also look through lxml.html next week to check for places where this can be used. For example, clean.py looks like an obvious candidate. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Done: https://codespeak.net/viewvc/?view=rev&revision=65680 No benchmarking yet, though. Holger -- GMX FreeDSL mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.! http://dslspecial.gmx.de/freedsl-aktionspreis/?ac=OM.AD.PD003K11308T4569a
data:image/s3,"s3://crabby-images/318d0/318d04c7ebb87fcdafbbb37fdece1cb8a42775e9" alt=""
Stefan: Has their been any action on this? I really want to help and I am really swamped. Sorry, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: Stefan Behnel [mailto:stefan_ml@behnel.de] Sent: Saturday, June 06, 2009 1:49 AM To: Robert Pierce; John Lovell Cc: lxml-dev@codespeak.net Subject: Re: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes Hi, Robert Pierce wrote:
Done: https://codespeak.net/viewvc/?view=rev&revision=65612 https://codespeak.net/viewvc/lxml/trunk/src/lxml/cleanup.pxi?view=markup&pathrev=65612
Since you two seem to be very happy about this feature, what about writing up some docs/doctests for it? A new section here sounds like the right place: http://codespeak.net/svn/lxml/trunk/doc/api.txt -> http://codespeak.net/lxml/api.html Maybe the tutorial could also benefit from a short reference. Holger, could you replace the current deannotate() implementation in lxml.objectify and add the xsl:nil cleanup option as we discussed? I expect it to be a little slower than before due to the more general implementation. If you have some code at your hands to benchmark it, please do. Unless Ian (or someone else) beats me to it, I'll also look through lxml.html next week to check for places where this can be used. For example, clean.py looks like an obvious candidate. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I suspected so but wasn't sure about the lxml.etree policy with regard to extending the elementtree API, apart from obvious libxml2/libxslt superpowers.
+1 for strip_*.
+1 for functions. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/e95e4/e95e4a506928b32e2459f3b11d387a918c6f4baa" alt=""
In case it isn't obvious, I'm not an XML guru and haven't been using lxml for long, but truly IMHO: I stipulate the importance of nil (or null) in schema definitions, as well as in attaching types to the in memory representation of the tree. But from the standpoint of text representation, <foo xsi:nil='true'/> doesn't seem to carry any additional information over <foo/>. My use case is passing XML through SQS, which has an upper bound of about 6kB (after http headers are accounted for). When lxml annotates empty elements, it attaches BOTH schema and type to each node, which increases the size of the text representation of the element by a factor of 4 or more. So I really have to deannotate it "all the way". On 6/2/09, jholg@gmx.de <jholg@gmx.de> wrote:
I think it is impossible to retain input intent once a tree is parsed into memory. Really, in the absence of a schema I shouldn't be able to tell the difference between your input and root = objectify.fromstring('<root><x/></root>') or root = objectify.fromstring('<root/>') root.x = None You can only ask for consistency on output. Currently, the output of deannotate is not consistent in this case. In any event, type constraints are more properly defined in a schema, aren't they? Just because you passed me <root><x xsi:nil='true'/></root> doesn't constrain me from passing you back <root><x><y/></x></root> unless there's a schema that says otherwise.
What's the use case for a deannotate() that removes xsi:nil? Why not just assign '' instead of None and deannotate() afterwards?
As you suggest, I can set the element value to '', so it is a string type and deannotate() removes the type. However, tostring() + deannotate() then produces <foo></foo> rather than <foo/>... better, but still not efficient. Of course, there is a valid argument to say that a space constrained API shouldn't use a bloated data format like XML at all, but (for my API) it's too late to make that argument.
A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :)
Works for me!
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Robert Pierce wrote:
I think it makes more sense to let an empty leaf element represent an empty string than to represent it as None. It's a matter of use cases, obviously.
My use case is passing XML through SQS
"SQS" is an ambiguous abbreviation.
which has an upper bound of about 6kB (after http headers are accounted for).
That sounds like a rather odd restriction. Doesn't it at least support compression?
Well, it /is/ different, though. >>> root = objectify.fromstring('<root><x/></root>') >>> str(root.x) '' >>> root = objectify.fromstring('<root/>') >>> root.x = None >>> str(root.x) 'None'
You can only ask for consistency on output.
No, lxml.objectify is a Python-object-like in-memory tree. Serialisation is only a way out, validation only a way to check what leaves the code that processed the tree. All the rest is about making it easy to use as a tree structure. That's what the annotations are there for. If you want to keep the necessary information during a serialise-parse cycle or not is up to you (or should be, so an option to remove everything is just fine). Stefan
participants (4)
-
jholg@gmx.de
-
John Lovell
-
Robert Pierce
-
Stefan Behnel