Re: [lxml] xmlfile and namespaces/pretty printing

10 Jun 2015

      Frederik Elwert schrieb am 09.06.2015 um 12:06:
...
I want to write a very large XML file to disc. Since I ran into memory
issues using the regular ElementTree.write() method, I switched to using
etree.xmlfile. Generally, it works quite well, but I ran into two
issues. Here’s my test code:
----8<----
from lxml import etree
P_DATA = '{http://www.dspin.de/data}'
P_TEXT = '{http://www.dspin.de/data/textcorpus}'
with etree.xmlfile('test.xml', encoding='utf-8') as xf:
    with xf.element(P_DATA + 'D-Spin',
                    nsmap={None: 'http://www.dspin.de/data'}):
        with xf.element(P_TEXT + 'TextCorpus',
                lang='de',
                nsmap={None: 'http://www.dspin.de/data/textcorpus'}):
            element = etree.Element(P_TEXT + 'tokens',
                    nsmap={None: 'http://www.dspin.de/data/textcorpus'})
            element2 = etree.SubElement(element, P_TEXT + 'token')
            xf.write(element, pretty_print=True)
---->8----
And here’s the output:
----8<----
<D-Spin xmlns="http://www.dspin.de/data"><TextCorpus
xmlns="http://www.dspin.de/data/textcorpus" lang="de"><tokens
xmlns="http://www.dspin.de/data/textcorpus">
  <token/>
</tokens>
</TextCorpus></D-Spin>
---->8----
Now my questions are:
1. I had to add an nsmap argument to the creation of "element" in order
to prevent an "ns0:" prefix in the output. But this lead to a
duplication of the declaration of the default namespace
'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.
Since the generation of the Elements that I write to the xmlfile happens
somewhere else in the real code, it is a bit cumbersome to add nsmaps
all over the place. And even then, I have the duplicated namespace
declaration. So ideally I’d like xf.write() to be aware of the current
namespace map defined by the xf.element. Is that possible?
Yes, that's a known issue currently. It's not easy to fix because when
serialising subtrees, the serialiser state is essentially blank and doesn't
know about previously written elements. I guess this could be worked around
by faking a new parent element with all parent namespaces for the element
that is being serialised. Not great, but might still work. Pull requests
welcome.
...
2. I can pass "pretty_print=True" to xf.write(), but it naturally only
affects those sub-trees. Is it possible to pretty-print the elements
generated by xf.element() as well? Maybe it would be nice to be able to
pass pretty_print to etree.xmlfile() itself?
You can get a poor-human's slightly better pretty-printing by doing what
you do above and additionally calling xf.write("\n") after each opening and
closing element() block.

While I would accept patches that implement a "pretty_print" flag for
xmlfile() itself, as you proposed, I don't think it's going to be easy to
make it work "as expected".

Stefan

Re: [lxml] xmlfile and namespaces/pretty printing

Stefan Behnel