[lxml-dev] generative building of xml?

I am generating, processing and eventually serializing several XML streams. I was wondering if this was possible to do with lxml? Here's the setup. I've got several databases generating XML content (which can be quite large), I really want to be able to process the database record progressively generating XML and sending out on its own stream. An aggregator/filter (elsewhere) will read the streams and parse them processing similar members and generate a new stream based on the combined streams. DB1 DB2 DB3 Core database XML XML XML XML genaration WS WS WS delivery over a stream using generator | | | +------+-----+ AGG Parse and match incoming streams (iterparse) XML WS send resulting merge as XML using generator. So the questions: 1.. Does anybody have a recipe to build a recursive generator using Element? 2. Given the above generator, is there any such thing as a generator version etree.tostring? -- Kristian Kvilekval kris@cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756

Hi, kris wrote:
I am generating, processing and eventually serializing several XML streams. I was wondering if this was possible to do with lxml?
Probably, although lxml is not designed for pipelined XML processing (any better than SAX, that is). It also depends on how your XML looks like. If it's from a database, it's probably something simple like <root> <row> <column>...</column> ... </row> ... </root> That shouldn't cause too many problems, you can use the (SAX-like) target parser to copy it into a simple Python container class, use that inside your program, merge all of those objects into a single stream at some point and then generate a new XML stream from that.
Here's the setup. I've got several databases generating XML content (which can be quite large), I really want to be able to process the database record progressively generating XML and sending out on its own stream.
An aggregator/filter (elsewhere) will read the streams and parse them processing similar members and generate a new stream based on the combined streams.
DB1 DB2 DB3 Core database XML XML XML XML genaration WS WS WS delivery over a stream using generator
A generator? Interesting. Why not just a file-like object? If the interface is a generator (yielding strings, I assume), then you will have to use the feed parser interface to copy the data into the parser, otherwise, you can just use one thread per DB connection and have it read and parse the data for you.
2. Given the above generator, is there any such thing as a generator version etree.tostring?
Nothing keeps you from yielding "<root>", followed by the serialised stream entries (call tostring() on each separately), followed by a "</root>". Stefan

On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote:
Hi,
Probably, although lxml is not designed for pipelined XML processing (any better than SAX, that is).
It also depends on how your XML looks like. If it's from a database, it's probably something simple like
<root> <row> <column>...</column> ... </row> ... </root>
That shouldn't cause too many problems, you can use the (SAX-like) target parser to copy it into a simple Python container class, use that inside your program, merge all of those objects into a single stream at some point and then generate a new XML stream from that.
Here's the setup. I've got several databases generating XML content (which can be quite large), I really want to be able to process the database record progressively generating XML and sending out on its own stream.
An aggregator/filter (elsewhere) will read the streams and parse them processing similar members and generate a new stream based on the combined streams.
DB1 DB2 DB3 Core database XML XML XML XML genaration WS WS WS delivery over a stream using generator
A generator? Interesting. Why not just a file-like object?
I was thinking of a generator because I am feeding this to a stream that works with/on generators .. The databases are returning a top-k queries as xml files. Each DB keeps generating its best hits as a stream the aggregator sorts them and send them to the client. I would like to propagate the query all the way to the component databases using generators to minimize the work each on does.
If the interface is a generator (yielding strings, I assume), then you will have to use the feed parser interface to copy the data into the parser, otherwise, you can just use one thread per DB connection and have it read and parse the data for you.
2. Given the above generator, is there any such thing as a generator version etree.tostring?
Nothing keeps you from yielding "<root>", followed by the serialised stream entries (call tostring() on each separately), followed by a "</root>".
Unfortunately it is a tree structure.. I would like to visit the tree in something like; yield "<root>" yield ' <child attr0="a" attr1="b" > ' yield ' <child ... ' ... yield ' </child ' yield ' </child>' yield ' <child attr0="c" attr1="d" > ' ... yield '</root'>
Stefan -- Kristian Kvilekval kris@cs.ucsb.edu http://www.cs.ucsb.edu/~kris w:805-636-1599 h:504-9756

Hi, kris wrote:
On Thu, 2008-05-08 at 09:22 +0200, Stefan Behnel wrote:
If the interface is a generator (yielding strings, I assume), then you will have to use the feed parser interface to copy the data into the parser, otherwise, you can just use one thread per DB connection and have it read and parse the data for you.
2. Given the above generator, is there any such thing as a generator version etree.tostring? Nothing keeps you from yielding "<root>", followed by the serialised stream entries (call tostring() on each separately), followed by a "</root>".
Unfortunately it is a tree structure.. I would like to visit the tree in something like;
yield "<root>" yield ' <child attr0="a" attr1="b" > ' yield ' <child ... ' ... yield ' </child ' yield ' </child>' yield ' <child attr0="c" attr1="d" > ' ... yield '</root'>
I think that's a bad idea, as you loose semantics that you will need to recover in each generator step. My approach would be: let the databases write file-like streams (a socket or whatever), attach an iterparse() thread to each of them, copy the data of each entry to a container object (or maybe just use iterparse() with lxml.objectify), merge the container objects into a single stream in a thread safe way and serialise the resulting stream of entries to an XML stream, maybe even manually, as I suggested. Stefan
participants (2)
-
kris
-
Stefan Behnel