Mailman 3 from xml to sql - lxml - The Python XML Toolkit

from xml to sql

Piotr Oh

Jan. 24, 2012

1:49 p.m.

Hi I'm not a programmer, so please be patient I just need some scripting done and my choice is python+lxml. The problem to solve is. 1. xml file with some data is exported from our ERP system using third party tools. It has xml schema. 2. It is intended to import to another system (sql/firebird) 3. Between the export/import process I'd like to validate the xml (using xml schema) 4. Than I need to import it to another system: 4a. check the values of the corresponding data in xml and SQL database, compare them, do some action, write to log etc 4b. put them to SQL database, update (update, insert new) Validating is simple (point 3). then I need to traverse the xml, record by record, do something with each and translate into sql query.

...

From this point of view IMHO what is the right way: use lxml.objectify or etree? I don't care about efficiency. Instead it should be as simple as possible (to modify, read etc)

Could you give me some hints here? Some programming idioms, patterns or just basic pseudocode to follow? Regards Piotr

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Adam Tauno Williams

January 2012

3:29 p.m.

On Tue, 2012-01-24 at 14:49 +0100, Piotr Oh wrote:

...

I'm not a programmer, so please be patient I just need some scripting done and my choice is python+lxml. The problem to solve is. 1. xml file with some data is exported from our ERP system using third party tools. It has xml schema. 2. It is intended to import to another system (sql/firebird) 3. Between the export/import process I'd like to validate the xml (using xml schema) 4. Than I need to import it to another system: 4a. check the values of the corresponding data in xml and SQL database, compare them, do some action, write to log etc 4b. put them to SQL database, update (update, insert new) From this point of view IMHO what is the right way: use lxml.objectify or etree? I don't care about efficiency. Instead it should be as simple as possible (to modify, read etc) Could you give me some hints here? Some programming idioms, patterns or just basic pseudocode to follow?

What you are describing is a classic use case of a workflow engine; numerous of which exist. Perhaps you should investigate using such a tool. Of course they are off-topic for an LXML list. Most allow you to select data from an RDBMS or import from some type of file [flat, delimited, etc...] to XML. Then you can use XSLT, XPath, what-not, and export the data to a file or insert/upsert/update the data into an RDBMS. For example, using the engine I hack on, I'd perform an sqlSelectAction, pass the output of that to a transformAction, and then pass the output of that to either an sqlInsertAction or an sqlUpsertAction. I just edit the routes [workflow definitions] in BPML [which is itself XML]. -- System & Network Administrator [ LPI & NCLA ] <http://www.whitemiceconsulting.com> OpenGroupware Developer <http://www.opengroupware.us> Adam Tauno Williams

Piotr Oh

8:03 p.m.

2012/1/24 Adam Tauno Williams <awilliam@whitemice.org>

...

What you are describing is a classic use case of a workflow engine; numerous of which exist. Perhaps you should investigate using such a tool. Of course they are off-topic for an LXML list.

Ok, I am aware that there are many ways to do this, but why not with lxml and some python code? Dealing with another high level technology like BPML and whatever else it requires is just too much for me. It smells java which means big APIs, a lot of docs to dig into etc.

...

Most allow you to select data from an RDBMS or import from some type of file [flat, delimited, etc...] to XML. Then you can use XSLT, XPath, what-not, and export the data to a file or insert/upsert/update the data into an RDBMS.

And what you are talking about here is free, opensource etc? Regards P.

Adam Tauno Williams

8:21 p.m.

On Tue, 2012-01-24 at 21:03 +0100, Piotr Oh wrote:

...

2012/1/24 Adam Tauno Williams <awilliam@whitemice.org> What you are describing is a classic use case of a workflow engine; numerous of which exist. Perhaps you should investigate using such a tool. Of course they are off-topic for an LXML list. Ok, I am aware that there are many ways to do this, but why not with lxml and some python code?

You can certainly do that; it just seems like a lot of plumbing code for something that is already 'done'. And you end up writing the code over and over again.

...

Dealing with another high level technology like BPML and whatever else it requires is just too much for me. Most allow you to select data from an RDBMS or import from some type of file [flat, delimited, etc...] to XML. Then you can use XSLT, XPath, what-not, and export the data to a file or insert/upsert/update the data into an RDBMS.

...

And what you are talking about here is free, opensource etc?

Numerous of them. I used to use BIE, which is Java, but have ported most of my routes over the OpenGrouwpare Coils [which contains a workflow engine; and I'm the primary developer of this Python based project]. <https://sourceforge.net/projects/coils/> It depends on what you want to do. If you just need a script Python is your answer. I like the ability to submit data for processing, getting error reports, being able to go back and look at how something was processed, how long it took, etc... -- System & Network Administrator [ LPI & NCLA ] <http://www.whitemiceconsulting.com> OpenGroupware Developer <http://www.opengroupware.us> Adam Tauno Williams

Piotr Oh

10:41 a.m.

2012/1/24 Adam Tauno Williams <awilliam@whitemice.org>

...

You can certainly do that; it just seems like a lot of plumbing code for something that is already 'done'. And you end up writing the code over and over again.

I have invetigated a bit the internet about what the workflow engine really is. Well it is big thing ;-) To be honest I even don't know how it could help in my little task. When I have searched for workflow for python I have found not to much. Generally those projects are tightly coupled with other projects like django, zope etc. And workflow == mostly java

...

...
Most allow you to select data from an RDBMS or import from some type of file [flat, delimited, etc...] to XML. Then you can use XSLT, XPath, what-not, and export the data to a file or insert/upsert/update the data into an RDBMS.

Nice, but the learning curve is IMHO much higher than lxml+ (custom python code to do the job done with no extra features)?

...

Numerous of them. I used to use BIE, which is Java, but have ported most of my routes over the OpenGrouwpare Coils [which contains a workflow engine; and I'm the primary developer of this Python based project]. <https://sourceforge.net/projects/coils/>

Opengroupware still alive? that page looks dead http://www.opengroupware.org/ and I wondered what happened to this project. Previously was it not php based?

...

It depends on what you want to do. If you just need a script Python is your answer. I like the ability to submit data for processing, getting error reports, being able to go back and look at how something was processed, how long it took, etc...

That would be nice, but I really don't even know where start the digging,

not to mention about using some workflow engine. P.

Adam Tauno Williams

10:56 a.m.

On Wed, 2012-01-25 at 11:41 +0100, Piotr Oh wrote:

...

2012/1/24 Adam Tauno Williams <awilliam@whitemice.org>

...
You can certainly do that; it just seems like a lot of plumbing code for something that is already 'done'. And you end up writing the code over and over again.

To be honest I even don't know how it could help in my little task. Ok. My post was only a suggestion of something worth investigating. [little tasks if they succeed don't stay little]. And this debate is off-topic for the list. But feel free to raid the OpenGroupware Coils source code [license is MIT/X11]. It can select data into XML and supports loading data into a Python DBAPI connection from XML [is uses a schema called 'StandardXML' to represent columnar type data].

...

...
Most allow you to select data from an RDBMS or import from some type of file [flat, delimited, etc...] to XML. Then you can use XSLT, XPath, what-not, and export the data to a file orinsert/upsert/update the data into an RDBMS

...

Nice, but the learning curve is IMHO much higher than lxml+ (custom python code to do the job done with no extra features)?

The "no extra features" comment is absurd.

...

Numerous of them. I used to use BIE, which is Java, but have ported mostof my routes over the OpenGrouwpare Coils [which contains a workflow engine; and I'm the primary developer of this Python based project]. <https://sourceforge.net/projects/coils/

...

Opengroupware still alive? that page looks dead http://www.opengroupware.org/ and I wondered what happened to this project.

That isn't the link I sent you.

...

Previously was it not php based?

Nope, never used PHP. OGo was/is built in Objective-C. -- Adam Tauno Williams <http://www.whitemiceconsulting.com> System Administrator, OpenGroupware Developer, LPI / CNA Fingerprint 8C08 209A FBE3 C41A DD2F A270 2D17 8FA4 D95E D383

Stefan Behnel

12:56 p.m.

Piotr Oh, 24.01.2012 14:49:

...

I'm not a programmer, so please be patient I just need some scripting done and my choice is python+lxml.

Excellent choice. :)

...

The problem to solve is. 1. xml file with some data is exported from our ERP system using third party tools. It has xml schema. 2. It is intended to import to another system (sql/firebird) 3. Between the export/import process I'd like to validate the xml (using xml schema)

That sounds like the easy part. What amount of data are you talking about? Most importantly: does it fit into memory or not? lxml's parser can validate during parsing, also for iterparse(), in case it won't fit.

...

4. Than I need to import it to another system: 4a. check the values of the corresponding data in xml and SQL database, compare them, do some action, write to log etc 4b. put them to SQL database, update (update, insert new)

I don't know how Firebird handles this, but you should try if a) it has a direct way to import XML data in some way b) you can get away with generating "INSERT OR UPDATE" statements. In general, a large amount of database roundtrips will make your program much slower than a direct import of a database dump, sometimes by orders of magnitude. Generating a SQL dump file and letting the DB load that directly is bound to be much faster. However, if the diff is a real requirement, you may still have to find a way to compare the data manually. What I did in a project once was to dump the database content in SQL format, then line diff that with a dump I had provided myself, after running both through Unix sort. So, one approach would be to write a script that converts the XML data to SQL statements that match those that your DB dumps itself, and then either import them or dump the current DB content next to it and run a diff.

...

Validating is simple (point 3). then I need to traverse the xml, record by record, do something with each and translate into sql query.

The most obvious approach to that is iterparse().

...

From this point of view IMHO what is the right way: use lxml.objectify or etree?

Sadly, objectify still doesn't support iterparse() directly, but it should be possible to install objectify's element lookup scheme as the default lookup scheme and then run iterparse(). http://lxml.de/objectify.html#advanced-element-class-lookup http://lxml.de/api/lxml.etree-module.html#set_element_class_lookup http://lxml.de/element_classes.html#setting-up-a-class-lookup-scheme

...

I don't care about efficiency. Instead it should be as simple as possible (to modify, read etc)

Both etree and objectify can be quite readable. If you go the "generate SQL dump" road, etree may be simpler because objectify's auto data conversion may get in the way when handling only strings, whereas if you take the database roundtrips road, your code may turn out to be more concise with objectify. Apart from that, choose what you like better. Stefan

Dirk Rothe

1:16 p.m.

...

I don't know how Firebird handles this, but you should try if

a) it has a direct way to import XML data in some way

b) you can get away with generating "INSERT OR UPDATE" statements.

In general, a large amount of database roundtrips will make your program much slower than a direct import of a database dump, sometimes by orders of magnitude. Generating a SQL dump file and letting the DB load that directly is bound to be much faster.

Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal. --dirk

Stefan Behnel

1:31 p.m.

Dirk Rothe, 25.01.2012 14:16:

...

...
I don't know how Firebird handles this, but you should try if

a) it has a direct way to import XML data in some way

b) you can get away with generating "INSERT OR UPDATE" statements.

In general, a large amount of database roundtrips will make your program much slower than a direct import of a database dump, sometimes by orders of magnitude. Generating a SQL dump file and letting the DB load that directly is bound to be much faster.

Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal.

Right, that makes sense for bulk updates. However, I was referring to the intention to do a diff and incremental updates between the DB content and the XML data, in which case there'd likely be a lot of requests sending data back and forth between the database and the program while parsing the XML file and comparing the corresponding DB entries. You can't really do any bulk access here. Stefan

Dirk Rothe

2:10 p.m.

Am 25.01.2012, 14:31 Uhr, schrieb Stefan Behnel <stefan_ml@behnel.de>:

...

Dirk Rothe, 25.01.2012 14:16:

...
...
I don't know how Firebird handles this, but you should try if

a) it has a direct way to import XML data in some way

b) you can get away with generating "INSERT OR UPDATE" statements.

In general, a large amount of database roundtrips will make your program much slower than a direct import of a database dump, sometimes by orders of magnitude. Generating a SQL dump file and letting the DB load that directly is bound to be much faster.

Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal.

Right, that makes sense for bulk updates.

However, I was referring to the intention to do a diff and incremental updates between the DB content and the XML data, in which case there'd likely be a lot of requests sending data back and forth between the database and the program while parsing the XML file and comparing the corresponding DB entries. You can't really do any bulk access here.

In my experience that leads normally to some form of caching and "incremental" bulky updates. But I guess thats not really lxml related anymore ;)

Piotr Oh

2:27 p.m.

...

Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal.

And what about odbc driver? which one of them would be more straightforward? simpler etc? P.

Stefan Behnel

2:34 p.m.

Piotr Oh, 25.01.2012 15:27:

...

...
Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal.

And what about odbc driver? which one of them would be more straightforward? simpler etc?

You won't use the driver directly in your code. Python has a generic API for that (DB-API 2) which you will use instead. Stefan

Dirk Rothe

4 p.m.

Am 25.01.2012, 15:27 Uhr, schrieb Piotr Oh <piotrlg@gmail.com>:

...

...
Assuming the usage of the kinterbasdb firebrid driver, the most efficient way would be to send a list of tuples as parameters for such a Update/Insert statement. The statement would have to be only prepared (pseudo-compiled, whatever) _once_ at the server side. And the number of network roundtrips will also be minimal.

And what about odbc driver? which one of them would be more straightforward? simpler etc?

Kinterbasdb exposes the DB-API 2, that was mentioned by Stefan, as far as I know, the ODBC driver does the same. But I would surely go for the native driver, as you get additonal access to the firebird specific service API (backups, restores, statistics, monitoring and stuff). And if one of both is more efficient, it's pretty much guaranteed to be the native one. --dirk

Piotr Oh

2:25 p.m.

...

That sounds like the easy part. What amount of data are you talking about? Most importantly: does it fit into memory or not?

Amount of data is not a problem.

...

I don't know how Firebird handles this, but you should try if

a) it has a direct way to import XML data in some way

But here I loose the in between process of data manipulation

...

b) you can get away with generating "INSERT OR UPDATE" statements.

Hm, that sound like I should avoid my first naive approach.

...

However, if the diff is a real requirement, you may still have to find a way to compare the data manually. What I did in a project once was to dump the database content in SQL format, then line diff that with a dump I had provided myself, after running both through Unix sort.

So basically you compared the values outside of the database? I wonder why it was faster, than using the database engine itself to search, compare etc. Are not databases supposed to do this in optimal way? Regards P.

Stefan Behnel

3:03 p.m.

Piotr Oh, 25.01.2012 15:25:

...

...
That sounds like the easy part. What amount of data are you talking about? Most importantly: does it fit into memory or not?

Amount of data is not a problem.

...
I don't know how Firebird handles this, but you should try if

a) it has a direct way to import XML data in some way

But here I loose the in between process of data manipulation

...
b) you can get away with generating "INSERT OR UPDATE" statements.

Hm, that sound like I should avoid my first naive approach.

Yes, databases can usually process dumps quite quickly and "INSERT OR UPDATE" is a good way to take advantage of that.

...

...
However, if the diff is a real requirement, you may still have to find a way to compare the data manually. What I did in a project once was to dump the database content in SQL format, then line diff that with a dump I had provided myself, after running both through Unix sort.

So basically you compared the values outside of the database? I wonder why it was faster, than using the database engine itself to search, compare etc. Are not databases supposed to do this in optimal way?

I would expect a line diff to be much faster because you avoid having to query the database for each entry. There's a mostly constant little overhead for each database query which can quickly add up. However, it's also likely harder to process because if you want more than a diff, you have to parse the data back from a SQL dump diff text file. Anyway, you mentioned that performance isn't crucial. If your processing only requires a couple of queries per entry, you may not notice the difference and it may turn out to be simpler to run the queries than to trigger a database dump and process that. Regarding database connectivity in Python, you should post your questions on comp.lang.python rather than here. Many people use OR-mapping tools for database interaction these days, e.g. SQLAlchemy or Django's OR mapper. Depending on the complexity of your queries, you may want to take a look at those, especially since you mentioned that code readability is of high value to you. Stefan

4770

Age (days ago)

4771

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Adam Tauno Williams
Dirk Rothe
Piotr Oh
Stefan Behnel

from xml to sql

Piotr Oh

Piotr Oh

Piotr Oh

Piotr Oh

Piotr Oh

tags

participants (4)