suppress namespace prefix in printouts
If you use lxml to extract elements from an XML text associated with a schema, you must give the program the element name with its namespace, and it will print out the element with its name space. Thus a TEI element <w xml:id="someID">Mary</w> will be written out to the outputfile as <w xmlns="http://www.tei-c.org/ns/1.0" xml:id="someID">Mary</w> Is there a way of suppressing the name space so that the output looks like the input, i.e. <w xml:id="someID">Mary</w> Martin Mueller Professor emeritus of English and Classics Northwestern University
Martin Mueller, 02.02.2014 13:45:
If you use lxml to extract elements from an XML text associated with a schema, you must give the program the element name with its namespace, and it will print out the element with its name space. Thus a TEI element
<w xml:id="someID">Mary</w>
will be written out to the outputfile as
<w xmlns="http://www.tei-c.org/ns/1.0" xml:id="someID">Mary</w>
Is there a way of suppressing the name space so that the output looks like the input, i.e.
<w xml:id="someID">Mary</w>
Hmm, I'm pretty sure the output *does* look like the input, because the input was most likely not this <w xml:id="someID">Mary</w> but something like this: <someroot xmlns="http://www.tei-c.org/ns/1.0"> ... <w xml:id="someID">Mary</w> ... </someroot> and when you only serialise one of the elements in that tree, lxml will make sure the output is correct, i.e. it contains all necessary namespace declarations. What you could do for debugging is something like this (untested!): import copy def dump_plain(root, **kwargs): root = copy.deepcopy(root) for el in root.iter('{*}*'): el.tag = etree.QName(el.tag).localname return etree.tostring(root, **kwargs) i.e. strip the namespaces off before you serialise it. Stefan
On 02/02/2014 13:37, Stefan Behnel wrote:
and when you only serialise one of the elements in that tree, lxml will make sure the output is correct, i.e. it contains all necessary namespace declarations.
For what it's worth, I've wanted to do this as well, in the context of incrementally generating 2nd-level tags in a long-running XML stream (like XMPP). So I've already sent: <stream xmlns="http://foo/bar" xmlns:baz="http://foo/ban"> ...and then at arbitrary intervals I want to send: <rpc> <command> <arg baz:format="xx">val</arg> </command> </rpc> ...but the lxml objects are namespaced from "http://foo/bar". I could just leave the namespace on the rpc tag as well but it seems cleaner not to (and matches what existing libs do for this particular protocol).
What you could do for debugging is something like this (untested!):
Interesting approach. Do you have any feel for how fast (or slow) this would be on large-ish documents? Also, this alters the tags, rather than preventing emitting the namespace. So it only works for a sub-set of cases, and in particular would not let me emit the "baz:format" as above, I think? Another approach is to wrap the thing in a dummy tag for string-isation and munge the text, but that's really yucky...
Phil Mayers, 02.02.2014 14:48:
On 02/02/2014 13:37, Stefan Behnel wrote:
and when you only serialise one of the elements in that tree, lxml will make sure the output is correct, i.e. it contains all necessary namespace declarations.
For what it's worth, I've wanted to do this as well, in the context of incrementally generating 2nd-level tags in a long-running XML stream (like XMPP).
I assume you're using this? http://lxml.de/api.html#incremental-xml-generation And for the receiver, there's this now: http://lxml.de/parsing.html#incremental-event-parsing (just noticed that the parser docs don't say that it's new in lxml 3.3)
So I've already sent:
<stream xmlns="http://foo/bar" xmlns:baz="http://foo/ban">
...and then at arbitrary intervals I want to send:
<rpc> <command> <arg baz:format="xx">val</arg> </command> </rpc>
...but the lxml objects are namespaced from "http://foo/bar". I could just leave the namespace on the rpc tag as well but it seems cleaner not to (and matches what existing libs do for this particular protocol).
It's XML, though, so keeping the namespace declaration in won't hurt. Incremental serialisation with xmlfile() also won't discard redundant declarations for you, although that could be done to a certain extent, I guess, at least for the simple cases.
Interesting approach. Do you have any feel for how fast (or slow) this would be on large-ish documents?
Pretty quick overall, but linear with the number of elements in the tree.
Also, this alters the tags, rather than preventing emitting the namespace.
That's why there is a deepcopy() in my example. That's pretty quick, too, but obviously needs twice the space and adds another linear time overhead.
So it only works for a sub-set of cases, and in particular would not let me emit the "baz:format" as above, I think?
That should still work. In fact, what I forgot to call in my example was cleanup_namespaces(). Otherwise, it would still keep the unused declarations around. http://lxml.de/api/lxml.etree-module.html#cleanup_namespaces
Another approach is to wrap the thing in a dummy tag for string-isation and munge the text, but that's really yucky...
Both approaches are. Stefan
On 02/02/2014 15:26, Stefan Behnel wrote:
Phil Mayers, 02.02.2014 14:48:
On 02/02/2014 13:37, Stefan Behnel wrote:
and when you only serialise one of the elements in that tree, lxml will make sure the output is correct, i.e. it contains all necessary namespace declarations.
For what it's worth, I've wanted to do this as well, in the context of incrementally generating 2nd-level tags in a long-running XML stream (like XMPP).
I assume you're using this?
No. This is taking place in an asynchronous, non-blocking context (using Twisted FWIW) so I can't make use of either of those. For the receiver side I'm just using an event parser and feeding it the data; it builds the tree manually and throws out completed 2nd level XML stanzas to a handler as they're complete. I then remove them from the tree to reclaim memory. For the send side, I'm cheating and generating the top-level XML by hand, and then generating the 2nd level stanza as indepdenent documents - hence the desire to omit namespaces.
Phil Mayers, 02.02.2014 17:30:
On 02/02/2014 15:26, Stefan Behnel wrote:
Phil Mayers, 02.02.2014 14:48:
On 02/02/2014 13:37, Stefan Behnel wrote:
and when you only serialise one of the elements in that tree, lxml will make sure the output is correct, i.e. it contains all necessary namespace declarations.
For what it's worth, I've wanted to do this as well, in the context of incrementally generating 2nd-level tags in a long-running XML stream (like XMPP).
I assume you're using this?
No. This is taking place in an asynchronous, non-blocking context (using Twisted FWIW) so I can't make use of either of those.
Well, actually the whole purpose of the XMLPullParser is to support incremental parsing in non-blocking code. That might be a little less obvious for xmlfile(), but it should generally work there, too. Depends a bit on Twisted's API, though, because it needs something underneath that has a write() method. The last time I used Twisted was a couple of years ago, so can't tell how easy it would be to bring the two together. Stefan
On 02/02/2014 17:20, Stefan Behnel wrote:
Well, actually the whole purpose of the XMLPullParser is to support incremental parsing in non-blocking code.
Sorry, I didn't read your links closely enough. I guess it is marginally cleaner than a target parser, but the target parser works fine for me and the code is already written. "Eternal" XML documents are relatively easy at the receive side.
That might be a little less obvious for xmlfile(), but it should generally work there, too. Depends a bit on Twisted's API, though, because it needs something underneath that has a write() method. The last time I used Twisted was a couple of years ago, so can't tell how easy it would be to bring the two together.
Blocking .write() is not problematic, it would be trivial to just extract data from a StringIO or similar. Twisted transports do provide a .write() however. AFACIT the thing xmlfile() provides is a way to serialise a tag "as a child" of another one, which solves the namespace thing. But from my PoV it's unfortunate that it's used via a context manager - asynchronous code styles differ, but I prefer to avoid keeping stack frames around. Presumably I could drive the context manager myself, but... yuck. Still, I will take a look; it might be just what I'm looking for (though our production boxes are all on lxml
Phil Mayers, 03.02.2014 23:32:
AFACIT the thing xmlfile() provides is a way to serialise a tag "as a child" of another one, which solves the namespace thing. But from my PoV it's unfortunate that it's used via a context manager - asynchronous code styles differ, but I prefer to avoid keeping stack frames around. Presumably I could drive the context manager myself, but... yuck.
It's pretty easy to map using a generator: def writer(out_stream, terminator="DONE"): with xmlfile(out_stream) as xf: with xf.element('root'): try: while True: el = (yield) xf.write(el) except GeneratorExit: pass w = writer(stream) next(w) And then, whenever you have something to write, you say w.send(el) And when done: w.close() Something along those lines. Completely untested, but if someone gets this to work nicely, I'd take doc patches. :) Stefan
On 04/02/14 10:29, Stefan Behnel wrote:
Phil Mayers, 03.02.2014 23:32:
AFACIT the thing xmlfile() provides is a way to serialise a tag "as a child" of another one, which solves the namespace thing. But from my PoV it's unfortunate that it's used via a context manager - asynchronous code styles differ, but I prefer to avoid keeping stack frames around. Presumably I could drive the context manager myself, but... yuck.
It's pretty easy to map using a generator:
That was my first thought. I might give it a try later; I see no reason it wouldn't work.
participants (3)
-
Martin Mueller
-
Phil Mayers
-
Stefan Behnel