create xml file incrementally

Hello, I have a large xml file that I need to modify and then store as a new xml file. The file has a structure similar to <root> <header> <txt>header txt</txt> </header> <record> <field1>1.0</field1> <subrecord> <field2>A1</field2> <field3>C1</field3> <subrecord> </record> <record> <field1>1.0</field1> <subrecord> <field2>A2</field2> <field3>C3</field3> <subrecord> </record> <record> <field1>1.0</field1> <subrecord> <field2>A4</field2> <field3>B</field3> <subrecord> </record> </root> I would like to modify the contents of the field3 tags. Now, due to the file size, I cannot load the complete document into memory and so I intend to use 'iterparse'. Traversing the document and updating the fields is no problem. What I am not sure about is how to write the modified data to a new xml file. The root tag is only complete when I have processed the complete file. What I need to do is write the start of the root tag (<root>) then write the header and the records and finally the end of root tag (</root>). Is there functionality in lxml to do this or should I use standard python writes for the initial <root> and final >/root>?

James Housden, 30.10.2012 20:35:
It's certainly easiest to just write out the root tag yourself. Take care of encodings in that case - as long as you only use UTF-8, you should be fine. Otherwise, you also have to write out an appropriate XML declaration before the root element and properly get the serialised XML elements into the file. Stefan

Stefan Behnel, 01.11.2012 22:46:
Actually, I'd love to see someone implement a magic API like this: # open an "XMLFile" object that knows about XML serialisation with xmlfile("somefile.xml", encoding='utf-8') as xf: # generate an element (the root element) with xf.Element('root-tag') as root_element: # generate content, e.g. through iterparse for element in generate_some_elements(): # serialise generated elements into the XML file xf.write(element) That looks like it should be totally trivial to do, but would make the above use case way simpler and safer. Stefan

Le 01/11/2012 22:56, Stefan Behnel a écrit :
I’ve seen this kind of thing called a "SAX-like serializer". http://hsivonen.iki.fi/producing-xml/ (This page has tons of advice on producing XML) Cheers, -- Simon Sapin

+-- Stefan Behnel: | Actually, I'd love to see someone implement a magic API like this: | | # open an "XMLFile" object that knows about XML serialisation | with xmlfile("somefile.xml", encoding='utf-8') as xf: | # generate an element (the root element) | with xf.Element('root-tag') as root_element: | # generate content, e.g. through iterparse | for element in generate_some_elements(): | # serialise generated elements into the XML file | xf.write(element) | | That looks like it should be totally trivial to do, but would make the | above use case way simpler and safer. +-- Already wrote it: http://www.nmt.edu/tcc/projects/sox/ Normally I use lxml for all XML work in Python. But I wrote this so I could generate large tables in HTML or XSL-FO form on a Web server that doesn't give me enough memory to build them as an ElementTree. An example: ================================================================ import sys import sox s = sox.Sox(sys.stdout) with s.start("html"): with s.start("head"): s.leaf("title","Page title here"): with s.start("body"): s.leaf("h1", "Main title here"): s.leaf("hr") s.leaf("p", {'class': 'note'}, "Some text", " and some more text", id='p001') ---------------------------------------------------------------- Best regards, John Shipman (john@nmt.edu), Applications Specialist New Mexico Tech Computer Center, Speare 146, Socorro, NM 87801 (575) 835-5735, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber

On 11/01/12 23:56, Stefan Behnel wrote:
Hi, Here's another idea: def g(x): for i in xrange(x): elt = Element("number") elt.text = str(i) yield elt elt = Element('some_root') elt.append_generator(g(1e10)) and make append_generator consume the generator only when someone iterates over elt, possibly only when serializing. Best, Burak

Stefan Behnel, 01.11.2012 22:56:
Thanks everyone for the comments. However, I really meant it the way I showed above. The use case is to freely (and efficiently) mix in-memory trees with incrementally generated content, and to safely write everything out into a file as it is being generated. In fact, the most important use case is to write out an XML declaration with a root element and then only write out in-memory trees one by one into that root element, e.g. coming from iterparse with some intermediate processing. Most of the code for that is already in serialiser.pxi anyway. Stefan

On 11/10/12 18:34, Stefan Behnel wrote:
Hi Stefan, Maybe I'm not seeing something obvious, but how would this work in the WSGI case where you have to yield bunch of strings instead of writing to a file-like object? From the WSGI spec (http://www.python.org/dev/peps/pep-0333/#the-write-callable): """ New WSGI applications and frameworks *should not* use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole. """ Emphasis not mine. Best, Burak

Burak Arslan, 17.11.2012 09:24:
It's a different use case, although a valid one. I can't see a way to merge the two. The above is a push interface, whereas the WSGI case uses a pull interface. However, note that I consider this a less important use case. The problem with generating XML as part of the WSGI output phase is two-fold. One is that certain stages in the WSGI pipeline may still force the iterable to be unfolded, before sending everything out. Not XML related, more of a general "sending large data" problem. That's obviously in the hands of the application designers, but if they incrementally generate their content they have to take care that their whole web stack handles this nicely. The more important problem is that serialisation errors will only be detected very late, further down in the WSGI pipeline and way outside the application code that might want to handle them. If the data is coming from a non-trivial source (and I would expect most sources of large amounts of data to be non-trivial), this means that you will end up sending a potentially large amount of data to the client before you notice that there is a problem that you have to handle. Anyway, I think the two use cases are sufficiently different to have two different interfaces. A "yield" based pull approach (potentially using "yield from" for structural chaining) doesn't fold well into a push interface for writing incrementally into a file. Stefan

Le 17/11/2012 11:17, Stefan Behnel a écrit :
It’s possible to wrap code that uses a .write() method into an iterator using greenlets: http://werkzeug.pocoo.org/docs/contrib/iterio/ Or, more generally: to convert a push interface into a pull interface using coroutines. Of course, the error handling issues that Stefan mentions still apply. Cheers, -- Simon Sapin

On 11/17/12 12:17, Stefan Behnel wrote:
See: https://github.com/arskom/spyne/issues/187 (if you wonder what spyne is, see http://spyne.io) So I'm quite familiar with both of the issues you mention. First point can be addressed rather easily with rigorous testing (see the streaming example in spyne's examples directory for a combination that works) But with the second point, things are more complicated. The problem stems from the shortcomings of the established application-level protocols -- none of them were designed to communicate mid-stream errors to the client. Unfortunately, I don't think there's a way around this besides designing a new rpc protocol. That said, most of my data comes from a select query where mid-stream erros are quite infrequent, so it's not as much of a problem when the error handling is done as upfront as possible. That's why when consuming a generator, Spyne runs it until the first yield before sending out any headers for protocols that support headers. (only Http and Soap as of now) Simon, I'm also aware of the technique that you point to, but as the WSGI spec also mentions, this comes with its own overhead, so should be used only as a last resort.
So, does this mean my earlier 'append_generator' suggestion has a green light? Best, Burak

Burak Arslan, 17.11.2012 21:37:
So, does this mean my earlier 'append_generator' suggestion has a green light?
Not as part of lxml, especially not as part of the tree API (where it is clearly misplaced). It would substantially complicate the internal implementation without providing a major advantage over other solutions. If you want to use generators for this, write a dedicated serialisation tool. Stefan

On 11/18/12 08:33, Stefan Behnel wrote:
I'm not familiar with the internals of lxml, so if you say it'd unnecessarily complicate things, i'd have to take your word for that. But then, does this mean lxml will never get to support a pull interface via generators? Also , the reason(s) why this interface does not belong in lxml are not so clear to me, could you elaborate? Here's another idea: ============== huge_parent = Element("some_call_response") huge_elt = SubElement(huge_parent, "some_huge_array_of_objects") yield serialize_until(huge_elt) yield from generate_xml_fragments() # a lot of 'yield etree.tostring(some_elt)'s yield serialize_after(huge_elt) ============== That's uglier but I think it's better than append_generator because its behaviour is explicit. (you can iterate only once on an element whose children come from generators) It's also an outside API, not part of the tree api. What do you think? Best, Burak

Burak Arslan, 18.11.2012 12:19:
The best way to convince other people that your idea is a good one is to implement it.
Also , the reason(s) why this interface does not belong in lxml are not so clear to me, could you elaborate?
The problem is the place where you put this new feature - you attach it to an Element. This means that all code in lxml that deals with tree handling will then have to make sure that any lazily attached subtrees get unfolded when necessary. And since there isn't much code in lxml that does not deal with tree handling in one way or another, we are talking about a *lot* of code here, and certainly a lot of unexpected places from a user's point of view. Apart from that, I wouldn't mind if you can come up with a good way to serialise XML incrementally using 'yield'. Stefan

Stefan Behnel, 10.11.2012 17:34:
One bug in the above code example:
with xf.Element('root-tag') as root_element:
This should simply read with xf.Element('root-tag'): because it doesn't make sense (and, in fact, would be problematic) to provide access to an element that was already serialised. I've started to drop the above into an implementation. Looks good so far. Stefan

James Housden, 30.10.2012 20:35:
It's certainly easiest to just write out the root tag yourself. Take care of encodings in that case - as long as you only use UTF-8, you should be fine. Otherwise, you also have to write out an appropriate XML declaration before the root element and properly get the serialised XML elements into the file. Stefan

Stefan Behnel, 01.11.2012 22:46:
Actually, I'd love to see someone implement a magic API like this: # open an "XMLFile" object that knows about XML serialisation with xmlfile("somefile.xml", encoding='utf-8') as xf: # generate an element (the root element) with xf.Element('root-tag') as root_element: # generate content, e.g. through iterparse for element in generate_some_elements(): # serialise generated elements into the XML file xf.write(element) That looks like it should be totally trivial to do, but would make the above use case way simpler and safer. Stefan

Le 01/11/2012 22:56, Stefan Behnel a écrit :
I’ve seen this kind of thing called a "SAX-like serializer". http://hsivonen.iki.fi/producing-xml/ (This page has tons of advice on producing XML) Cheers, -- Simon Sapin

+-- Stefan Behnel: | Actually, I'd love to see someone implement a magic API like this: | | # open an "XMLFile" object that knows about XML serialisation | with xmlfile("somefile.xml", encoding='utf-8') as xf: | # generate an element (the root element) | with xf.Element('root-tag') as root_element: | # generate content, e.g. through iterparse | for element in generate_some_elements(): | # serialise generated elements into the XML file | xf.write(element) | | That looks like it should be totally trivial to do, but would make the | above use case way simpler and safer. +-- Already wrote it: http://www.nmt.edu/tcc/projects/sox/ Normally I use lxml for all XML work in Python. But I wrote this so I could generate large tables in HTML or XSL-FO form on a Web server that doesn't give me enough memory to build them as an ElementTree. An example: ================================================================ import sys import sox s = sox.Sox(sys.stdout) with s.start("html"): with s.start("head"): s.leaf("title","Page title here"): with s.start("body"): s.leaf("h1", "Main title here"): s.leaf("hr") s.leaf("p", {'class': 'note'}, "Some text", " and some more text", id='p001') ---------------------------------------------------------------- Best regards, John Shipman (john@nmt.edu), Applications Specialist New Mexico Tech Computer Center, Speare 146, Socorro, NM 87801 (575) 835-5735, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber

On 11/01/12 23:56, Stefan Behnel wrote:
Hi, Here's another idea: def g(x): for i in xrange(x): elt = Element("number") elt.text = str(i) yield elt elt = Element('some_root') elt.append_generator(g(1e10)) and make append_generator consume the generator only when someone iterates over elt, possibly only when serializing. Best, Burak

Stefan Behnel, 01.11.2012 22:56:
Thanks everyone for the comments. However, I really meant it the way I showed above. The use case is to freely (and efficiently) mix in-memory trees with incrementally generated content, and to safely write everything out into a file as it is being generated. In fact, the most important use case is to write out an XML declaration with a root element and then only write out in-memory trees one by one into that root element, e.g. coming from iterparse with some intermediate processing. Most of the code for that is already in serialiser.pxi anyway. Stefan

On 11/10/12 18:34, Stefan Behnel wrote:
Hi Stefan, Maybe I'm not seeing something obvious, but how would this work in the WSGI case where you have to yield bunch of strings instead of writing to a file-like object? From the WSGI spec (http://www.python.org/dev/peps/pep-0333/#the-write-callable): """ New WSGI applications and frameworks *should not* use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole. """ Emphasis not mine. Best, Burak

Burak Arslan, 17.11.2012 09:24:
It's a different use case, although a valid one. I can't see a way to merge the two. The above is a push interface, whereas the WSGI case uses a pull interface. However, note that I consider this a less important use case. The problem with generating XML as part of the WSGI output phase is two-fold. One is that certain stages in the WSGI pipeline may still force the iterable to be unfolded, before sending everything out. Not XML related, more of a general "sending large data" problem. That's obviously in the hands of the application designers, but if they incrementally generate their content they have to take care that their whole web stack handles this nicely. The more important problem is that serialisation errors will only be detected very late, further down in the WSGI pipeline and way outside the application code that might want to handle them. If the data is coming from a non-trivial source (and I would expect most sources of large amounts of data to be non-trivial), this means that you will end up sending a potentially large amount of data to the client before you notice that there is a problem that you have to handle. Anyway, I think the two use cases are sufficiently different to have two different interfaces. A "yield" based pull approach (potentially using "yield from" for structural chaining) doesn't fold well into a push interface for writing incrementally into a file. Stefan

Le 17/11/2012 11:17, Stefan Behnel a écrit :
It’s possible to wrap code that uses a .write() method into an iterator using greenlets: http://werkzeug.pocoo.org/docs/contrib/iterio/ Or, more generally: to convert a push interface into a pull interface using coroutines. Of course, the error handling issues that Stefan mentions still apply. Cheers, -- Simon Sapin

On 11/17/12 12:17, Stefan Behnel wrote:
See: https://github.com/arskom/spyne/issues/187 (if you wonder what spyne is, see http://spyne.io) So I'm quite familiar with both of the issues you mention. First point can be addressed rather easily with rigorous testing (see the streaming example in spyne's examples directory for a combination that works) But with the second point, things are more complicated. The problem stems from the shortcomings of the established application-level protocols -- none of them were designed to communicate mid-stream errors to the client. Unfortunately, I don't think there's a way around this besides designing a new rpc protocol. That said, most of my data comes from a select query where mid-stream erros are quite infrequent, so it's not as much of a problem when the error handling is done as upfront as possible. That's why when consuming a generator, Spyne runs it until the first yield before sending out any headers for protocols that support headers. (only Http and Soap as of now) Simon, I'm also aware of the technique that you point to, but as the WSGI spec also mentions, this comes with its own overhead, so should be used only as a last resort.
So, does this mean my earlier 'append_generator' suggestion has a green light? Best, Burak

Burak Arslan, 17.11.2012 21:37:
So, does this mean my earlier 'append_generator' suggestion has a green light?
Not as part of lxml, especially not as part of the tree API (where it is clearly misplaced). It would substantially complicate the internal implementation without providing a major advantage over other solutions. If you want to use generators for this, write a dedicated serialisation tool. Stefan

On 11/18/12 08:33, Stefan Behnel wrote:
I'm not familiar with the internals of lxml, so if you say it'd unnecessarily complicate things, i'd have to take your word for that. But then, does this mean lxml will never get to support a pull interface via generators? Also , the reason(s) why this interface does not belong in lxml are not so clear to me, could you elaborate? Here's another idea: ============== huge_parent = Element("some_call_response") huge_elt = SubElement(huge_parent, "some_huge_array_of_objects") yield serialize_until(huge_elt) yield from generate_xml_fragments() # a lot of 'yield etree.tostring(some_elt)'s yield serialize_after(huge_elt) ============== That's uglier but I think it's better than append_generator because its behaviour is explicit. (you can iterate only once on an element whose children come from generators) It's also an outside API, not part of the tree api. What do you think? Best, Burak

Burak Arslan, 18.11.2012 12:19:
The best way to convince other people that your idea is a good one is to implement it.
Also , the reason(s) why this interface does not belong in lxml are not so clear to me, could you elaborate?
The problem is the place where you put this new feature - you attach it to an Element. This means that all code in lxml that deals with tree handling will then have to make sure that any lazily attached subtrees get unfolded when necessary. And since there isn't much code in lxml that does not deal with tree handling in one way or another, we are talking about a *lot* of code here, and certainly a lot of unexpected places from a user's point of view. Apart from that, I wouldn't mind if you can come up with a good way to serialise XML incrementally using 'yield'. Stefan

Stefan Behnel, 10.11.2012 17:34:
One bug in the above code example:
with xf.Element('root-tag') as root_element:
This should simply read with xf.Element('root-tag'): because it doesn't make sense (and, in fact, would be problematic) to provide access to an element that was already serialised. I've started to drop the above into an implementation. Looks good so far. Stefan
participants (5)
-
Burak Arslan
-
James Housden
-
John W. Shipman
-
Simon Sapin
-
Stefan Behnel