
Piotr Oh, 24.01.2012 14:49:
I'm not a programmer, so please be patient I just need some scripting done and my choice is python+lxml.
Excellent choice. :)
The problem to solve is. 1. xml file with some data is exported from our ERP system using third party tools. It has xml schema. 2. It is intended to import to another system (sql/firebird) 3. Between the export/import process I'd like to validate the xml (using xml schema)
That sounds like the easy part. What amount of data are you talking about? Most importantly: does it fit into memory or not? lxml's parser can validate during parsing, also for iterparse(), in case it won't fit.
4. Than I need to import it to another system: 4a. check the values of the corresponding data in xml and SQL database, compare them, do some action, write to log etc 4b. put them to SQL database, update (update, insert new)
I don't know how Firebird handles this, but you should try if a) it has a direct way to import XML data in some way b) you can get away with generating "INSERT OR UPDATE" statements. In general, a large amount of database roundtrips will make your program much slower than a direct import of a database dump, sometimes by orders of magnitude. Generating a SQL dump file and letting the DB load that directly is bound to be much faster. However, if the diff is a real requirement, you may still have to find a way to compare the data manually. What I did in a project once was to dump the database content in SQL format, then line diff that with a dump I had provided myself, after running both through Unix sort. So, one approach would be to write a script that converts the XML data to SQL statements that match those that your DB dumps itself, and then either import them or dump the current DB content next to it and run a diff.
Validating is simple (point 3). then I need to traverse the xml, record by record, do something with each and translate into sql query.
The most obvious approach to that is iterparse().
From this point of view IMHO what is the right way: use lxml.objectify or etree?
Sadly, objectify still doesn't support iterparse() directly, but it should be possible to install objectify's element lookup scheme as the default lookup scheme and then run iterparse(). http://lxml.de/objectify.html#advanced-element-class-lookup http://lxml.de/api/lxml.etree-module.html#set_element_class_lookup http://lxml.de/element_classes.html#setting-up-a-class-lookup-scheme
I don't care about efficiency. Instead it should be as simple as possible (to modify, read etc)
Both etree and objectify can be quite readable. If you go the "generate SQL dump" road, etree may be simpler because objectify's auto data conversion may get in the way when handling only strings, whereas if you take the database roundtrips road, your code may turn out to be more concise with objectify. Apart from that, choose what you like better. Stefan