xmlfile and namespaces/pretty printing
Hello, I want to write a very large XML file to disc. Since I ran into memory issues using the regular ElementTree.write() method, I switched to using etree.xmlfile. Generally, it works quite well, but I ran into two issues. Here’s my test code: ----8<---- from lxml import etree P_DATA = '{http://www.dspin.de/data}' P_TEXT = '{http://www.dspin.de/data/textcorpus}' with etree.xmlfile('test.xml', encoding='utf-8') as xf: with xf.element(P_DATA + 'D-Spin', nsmap={None: 'http://www.dspin.de/data'}): with xf.element(P_TEXT + 'TextCorpus', lang='de', nsmap={None: 'http://www.dspin.de/data/textcorpus'}): element = etree.Element(P_TEXT + 'tokens', nsmap={None: 'http://www.dspin.de/data/textcorpus'}) element2 = etree.SubElement(element, P_TEXT + 'token') xf.write(element, pretty_print=True) ---->8---- And here’s the output: ----8<---- <D-Spin xmlns="http://www.dspin.de/data"><TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"><tokens xmlns="http://www.dspin.de/data/textcorpus"> <token/> </tokens> </TextCorpus></D-Spin> ---->8---- Now my questions are: 1. I had to add an nsmap argument to the creation of "element" in order to prevent an "ns0:" prefix in the output. But this lead to a duplication of the declaration of the default namespace 'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>. Since the generation of the Elements that I write to the xmlfile happens somewhere else in the real code, it is a bit cumbersome to add nsmaps all over the place. And even then, I have the duplicated namespace declaration. So ideally I’d like xf.write() to be aware of the current namespace map defined by the xf.element. Is that possible? 2. I can pass "pretty_print=True" to xf.write(), but it naturally only affects those sub-trees. Is it possible to pretty-print the elements generated by xf.element() as well? Maybe it would be nice to be able to pass pretty_print to etree.xmlfile() itself? Regards, Frederik -- Dr. Frederik Elwert Project Manager/SeNeReKo Postdoctoral Researcher/KHK Centre for Religious Studies Ruhr-University Bochum Universitätsstr. 150 D-44780 Bochum Phone +49(0)234 32-23024
Am .06.2015, 12:06 Uhr, schrieb Frederik Elwert <frederik.elwert@rub.de>:
1. I had to add an nsmap argument to the creation of "element" in order to prevent an "ns0:" prefix in the output. But this lead to a duplication of the declaration of the default namespace 'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.
If you assign a default namespace then *all* child elements are within this namespace unless otherwise specified. i.e. don't use FQN for child elements. If you want to pretty print the result then it's hard to beat using the tidy command line tool: tidy -m --xml file.xml Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
Am 09.06.2015 um 18:22 schrieb Charlie Clark:
Am .06.2015, 12:06 Uhr, schrieb Frederik Elwert <frederik.elwert@rub.de>:
1. I had to add an nsmap argument to the creation of "element" in order to prevent an "ns0:" prefix in the output. But this lead to a duplication of the declaration of the default namespace 'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.
If you assign a default namespace then *all* child elements are within this namespace unless otherwise specified. i.e. don't use FQN for child elements.
You mean just to create the child elements by not specifying their namespace? The problem is that I don’t always control the namespace declaration on the parent tag. Using etree.xmlfile() is my resort if dealing with very large files, but normally I just add new sub-trees to an existing document. So I feel it would be a bit shaky to assume a default namespace when generating the sub-trees. So ideally, I’d like etree.xmlfile() to be aware of the current namespace declaration and use the default namespace when it is set on the parent element. This is how lxml deals with regular trees, but xmlfile() seems not to be able to do that.
If you want to pretty print the result then it's hard to beat using the tidy command line tool:
tidy -m --xml file.xml
Since I also serve the XML output on the fly through a web service, I would like to avoid having to do that. But I might do that in this special case, thanks for the hint! Regards, Frederik -- Dr. Frederik Elwert Post-doctoral researcher Project manager SeNeReKo Center for Religious Studies Ruhr-University Bochum Universitätsstr. 150 D-44780 Bochum Room FNO 01/180 Tel. +49-(0)234 - 32 24794
Am .06.2015, 18:49 Uhr, schrieb Frederik Elwert <frederik.elwert@rub.de>:
You mean just to create the child elements by not specifying their namespace?
Yes, that is what the default namespace means.
The problem is that I don’t always control the namespace declaration on the parent tag. Using etree.xmlfile() is my resort if dealing with very large files, but normally I just add new sub-trees to an existing document. So I feel it would be a bit shaky to assume a default namespace when generating the sub-trees.
You can keep FQN if you like as, as far as I know, there is no difference to a parser between a prefixed and unprefixed tag with a default namespace. For large files with a single namespace, using a default namespace makes things much easier to read.
So ideally, I’d like etree.xmlfile() to be aware of the current namespace declaration and use the default namespace when it is set on the parent element. This is how lxml deals with regular trees, but xmlfile() seems not to be able to do that.
The default namespace really is a sleight of hand. You might be able to use the namespace registry to manage it.
If you want to pretty print the result then it's hard to beat using the tidy command line tool:
tidy -m --xml file.xml Since I also serve the XML output on the fly through a web service, I would like to avoid having to do that. But I might do that in this special case, thanks for the hint!
It's very fast and uses very little memory and is able to work with the whole document and can work with stdin and stdout so performance should be more than satisfactory. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
Frederik Elwert schrieb am 09.06.2015 um 12:06:
I want to write a very large XML file to disc. Since I ran into memory issues using the regular ElementTree.write() method, I switched to using etree.xmlfile. Generally, it works quite well, but I ran into two issues. Here’s my test code:
----8<----
from lxml import etree
P_DATA = '{http://www.dspin.de/data}' P_TEXT = '{http://www.dspin.de/data/textcorpus}'
with etree.xmlfile('test.xml', encoding='utf-8') as xf: with xf.element(P_DATA + 'D-Spin', nsmap={None: 'http://www.dspin.de/data'}): with xf.element(P_TEXT + 'TextCorpus', lang='de', nsmap={None: 'http://www.dspin.de/data/textcorpus'}): element = etree.Element(P_TEXT + 'tokens', nsmap={None: 'http://www.dspin.de/data/textcorpus'}) element2 = etree.SubElement(element, P_TEXT + 'token') xf.write(element, pretty_print=True)
---->8----
And here’s the output:
----8<---- <D-Spin xmlns="http://www.dspin.de/data"><TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"><tokens xmlns="http://www.dspin.de/data/textcorpus"> <token/> </tokens> </TextCorpus></D-Spin> ---->8----
Now my questions are:
1. I had to add an nsmap argument to the creation of "element" in order to prevent an "ns0:" prefix in the output. But this lead to a duplication of the declaration of the default namespace 'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.
Since the generation of the Elements that I write to the xmlfile happens somewhere else in the real code, it is a bit cumbersome to add nsmaps all over the place. And even then, I have the duplicated namespace declaration. So ideally I’d like xf.write() to be aware of the current namespace map defined by the xf.element. Is that possible?
Yes, that's a known issue currently. It's not easy to fix because when serialising subtrees, the serialiser state is essentially blank and doesn't know about previously written elements. I guess this could be worked around by faking a new parent element with all parent namespaces for the element that is being serialised. Not great, but might still work. Pull requests welcome.
2. I can pass "pretty_print=True" to xf.write(), but it naturally only affects those sub-trees. Is it possible to pretty-print the elements generated by xf.element() as well? Maybe it would be nice to be able to pass pretty_print to etree.xmlfile() itself?
You can get a poor-human's slightly better pretty-printing by doing what you do above and additionally calling xf.write("\n") after each opening and closing element() block. While I would accept patches that implement a "pretty_print" flag for xmlfile() itself, as you proposed, I don't think it's going to be easy to make it work "as expected". Stefan
Thanks a lot for the clarifications and suggestions! Regards, Frederik Am 10.06.2015 um 11:12 schrieb Stefan Behnel:
Frederik Elwert schrieb am 09.06.2015 um 12:06:
I want to write a very large XML file to disc. Since I ran into memory issues using the regular ElementTree.write() method, I switched to using etree.xmlfile. Generally, it works quite well, but I ran into two issues. Here’s my test code:
----8<----
from lxml import etree
P_DATA = '{http://www.dspin.de/data}' P_TEXT = '{http://www.dspin.de/data/textcorpus}'
with etree.xmlfile('test.xml', encoding='utf-8') as xf: with xf.element(P_DATA + 'D-Spin', nsmap={None: 'http://www.dspin.de/data'}): with xf.element(P_TEXT + 'TextCorpus', lang='de', nsmap={None: 'http://www.dspin.de/data/textcorpus'}): element = etree.Element(P_TEXT + 'tokens', nsmap={None: 'http://www.dspin.de/data/textcorpus'}) element2 = etree.SubElement(element, P_TEXT + 'token') xf.write(element, pretty_print=True)
---->8----
And here’s the output:
----8<---- <D-Spin xmlns="http://www.dspin.de/data"><TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"><tokens xmlns="http://www.dspin.de/data/textcorpus"> <token/> </tokens> </TextCorpus></D-Spin> ---->8----
Now my questions are:
1. I had to add an nsmap argument to the creation of "element" in order to prevent an "ns0:" prefix in the output. But this lead to a duplication of the declaration of the default namespace 'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.
Since the generation of the Elements that I write to the xmlfile happens somewhere else in the real code, it is a bit cumbersome to add nsmaps all over the place. And even then, I have the duplicated namespace declaration. So ideally I’d like xf.write() to be aware of the current namespace map defined by the xf.element. Is that possible?
Yes, that's a known issue currently. It's not easy to fix because when serialising subtrees, the serialiser state is essentially blank and doesn't know about previously written elements. I guess this could be worked around by faking a new parent element with all parent namespaces for the element that is being serialised. Not great, but might still work. Pull requests welcome.
2. I can pass "pretty_print=True" to xf.write(), but it naturally only affects those sub-trees. Is it possible to pretty-print the elements generated by xf.element() as well? Maybe it would be nice to be able to pass pretty_print to etree.xmlfile() itself?
You can get a poor-human's slightly better pretty-printing by doing what you do above and additionally calling xf.write("\n") after each opening and closing element() block.
While I would accept patches that implement a "pretty_print" flag for xmlfile() itself, as you proposed, I don't think it's going to be easy to make it work "as expected".
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Dr. Frederik Elwert Post-doctoral researcher Project manager SeNeReKo Center for Religious Studies Ruhr-University Bochum Universitätsstr. 150 D-44780 Bochum Room FNO 01/180 Tel. +49-(0)234 - 32 24794
participants (3)
-
Charlie Clark
-
Frederik Elwert
-
Stefan Behnel