Somne newby questions on lxml
I'd search the archives for answers to these questions (and to know it the broken links are already reported) but the links to the archive site appear to be broken (and a google search didn't do much better) on both the http://lxml.de/3.4/ and https://mailman-mail5.webfaction.com/listinfo/lxml web pages. That said, I'm new to both python and lxml and mostly new to xml/svg too. I've been fighting with the perl bindings to the libxml2 C library without much success, so I decided to see what python had to offer. I'm trying to make changes to the svg file output from Inkscape so I need to read the svg, make changes (which I have figured out how to do in perl) and then write the file back out. That all works in perl except pretty printing is broken. In python, I can somewhat make pretty printing work, but am struggling with the output file. Using this (partial here, fed the complete file to the script) svg file as input: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <svg xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" id="svg" enable-background="new 0 0 22.484 43.997" xml:space="preserve" height="0.80000007in" viewBox="0 0 71.999791 57.600007" width="1in" version="1.1" docname="Ammeter_4wire_led_1_schematic.svg" y="0" x="0" gorn="0"><metadata id="metadata3411"><rdf:RDF><cc:Work rdf:about=""><dc:format>image/svg+xml</dc:format><dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:title></dc:title></cc:Work></rdf:RDF> ... in to this python script (under cygwin on Windows). I removed the user land pretty printing function that will actually pretty print the bulk of the svg code after the name space stuff: #!/usr/bin/python # A python xml pretty_print script (which works on svg files unlike the # builtin version!). # Import os and sys to get file rename and the argv stuff import os,sys # and the lxml library for the xml from lxml import etree # Start of the main script if len(sys.argv) < 2: # No input file so print a usage message and exit. s = str(sys.argv[0]) + ': Usage: ' + str(sys.argv[0]) + ' xml_file' print s sys.exit() # Parse the xml document try: file = sys.argv[1] parser = etree.XMLParser(remove_blank_text=True) doc = etree.parse(file) except IOError: print 'Error: can\'t find or read file ' + file sys.exit() else: doc.write(sys.stdout, pretty_print=True) # doc.write(sys.stdout, xml_declaration=True, encoding="UTF-8", standalone=False, pretty_print=True) sys.exit() produces this output: <svg xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" id="svg" enable-background="new 0 0 22.484 43.997" xml:space="preserve" height="0.80000007in" viewBox="0 0 71.999791 57.600007" width="1in" version="1.1" docname="Ammeter_4wire_led_1_schematic.svg" y="0" x="0" gorn="0"> <metadata id="metadata3411"> <rdf:RDF> <cc:Work rdf:about=""> <dc:format>image/svg+xml</dc:format> <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> <dc:title/> </cc:Work> </rdf:RDF> There are two things wrong here: first the <?xml version="1.0" encoding="UTF-8" standalone="no"?> line is missing in the output. I have figured out that it can be added via the commented out output statement in the python code as long as I know (or assume I know) what the encoding was, but I would prefer to copy the values from the input line through (which is what the perl bindings do). Is there a way to do that that I'm not finding? Then the pretty printing has failed on the name space xml at the top of the file (possibly because I haven't done something necessary to register the name spaces?) I'm first looking to be able to read in and write out (without making any changes) the svg file so I know at least that works before I start changing things but am not having much luck at it. There are almost no examples that google can find of code that does this (I found the output paramters in the svgutils package on github which uses lxml, but he too is assuming he knows what the encoding is, and doesn't copy the input settings). Can anyone suggest a solution or point me at documentation or preferably examples of how it is done (or at a more suitable mailing list if this isn't it)? Peter Van Epp
Hi Peter,
line is missing in the output. I have figured out that it can be added via the commented out output statement in the python code as long as I know (or assume I know) what the encoding was, but I would prefer to copy the values from the input line through (which is what the perl bindings do). Is there a way to do that that I'm not finding?
The XML data is basically unicode and encoded during the write with the encoding requested. Because of that, the decoding during the read is independent of the encoding on write. However, you can get all information in the xml declaration from the parsed XML:
x = etree.parse('/tmp/test.xml')
x.docinfo.encoding 'utf-8’
x.docinfo.standalone False
x.docinfo.xml_version '1.0'
x.write('/tmp/2.xml', encoding='UTF-16')
y = etree.parse('/tmp/2.xml')
y.docinfo.encoding 'UTF-16’
Then the pretty printing has failed on the name space xml at the top of the file (possibly because I haven't done something necessary to register the name spaces?)
The exact arrangement of whitespace between attributes is not saved, as far as I know, and because of that, pretty print does not arrange attributes in a special way, it just indents the elements. jens
participants (2)
-
Jens Quade
-
Peter Van Epp