[lxml-dev] ElementTree pretty printing (serialisation)

Hello again, My constant spamming of this list has finally paid of and I have something to show for all my questions. I have attached the source of a fairly advanced pretty printing serialiser for ElementTree (and ResultTree) objects. The script (after `chmod 755`) can be called from the command line like so: ./prettyprint.py [DOCUMENT] The script can also be imported and used like so: import sys from lxml import etree import prettyprint document = etree.parse(...) serialiser = prettyprint.ElementTreeSerialiser() serialiser.write(document, sys.stdout) I am new to python, and even newer to lxml and the ElementTree API. As a consequence I may be missing some obvious optimisations. This is only my first stab at this problem and I would love any feed back you care to offer. Is pretty printing something that is on the lxml time line - and if not, would my method demonstrated here interest you from an implementation point of view? Thanks so much for your time. Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
My constant spamming of this list has finally paid of and I have something to show for all my questions.
I have attached the source of a fairly advanced pretty printing serialiser for ElementTree (and ResultTree) objects.
Thanks for sharing this.
The script (after `chmod 755`) can be called from the command line like so: ./prettyprint.py [DOCUMENT]
Admittedly, that is shorter than this: python -c 'from lxml.etree import ElementTree as et; \ et("myfile.xml").write("mynewfile.xml", pretty_print=True)'
Is pretty printing something that is on the lxml time line
It's in 1.0.beta. http://codespeak.net/svn/lxml/trunk/CHANGES.txt
- and if not, would my method demonstrated here interest you from an implementation point of view?
Ok, I looked through it and the only difference I could see compared to the pretty_print keyword is that you also wrap the data (.text) unless prevented by the list in 'preformated_elements'. You also use a hook for treating the serialised byte stream, although I think it's a bad idea to do this for splitting elements between attributes. So, my impression is that you are duplicating the pretty print code that we already have in lxml. I really think you should decide which way you go: serialising 'by hand' or treating the XML byte stream. If you want to work on the byte stream, you might consider using the pretty printer of libxml2 and then check each line if it is already short enough before treating it. Splitting long lines at whitespace is a pretty simple thing to do. If you want to serialize by walking the tree, then you should do that completely at the element level, preferably with code that also works for the original ElementTree. So, I think there are a lot of possible simplifications for your code. Stefan

Hi,
Is pretty printing something that is on the lxml time line
It's in 1.0.beta.
Oh, I feel a little silly now - I had no idea it was already included. I took this opportunity to easy_install the 1.0beta version.
Ok, I looked through it and the only difference I could see compared to the pretty_print keyword is that you also wrap the data (.text) unless prevented by the list in 'preformated_elements'.
Not true - I have examined the lxml pretty print output and there is quite some difference. In fact, with the exception of a few simplistic documents with no textual element content I cannot see what effect pretty_print has.
You also use a hook for treating the serialised byte stream, although I think it's a bad idea to do this for splitting elements between attributes.
Why do you think this is a bad idea? I was a little hesitant about doing it in the first place because technically it is no longer XML processing, but string processing. As safe as I think my code is, it does feel like there aught to be a better way of doing it. Or do you think it is a bad idea because you don't think element tags should be wrapped on an attributes basis?
So, my impression is that you are duplicating the pretty print code that we already have in lxml. I really think you should decide which way you go: serialising 'by hand' or treating the XML byte stream.
Like I stated above - I still cannot see what pretty_print is actually doing. I do know that it is unsuitable for my purposes because of it's inability to word wrap and indent elements tags (which is the definition of pretty printing IMHO). In addition, I am curious why you think the combination of ElementTree navigation and byte stream manipulation is a bad one. I would love to wrap element tags on an attribute basis in a programmatic way - but string processing seemed like my only option.
If you want to work on the byte stream, you might consider using the pretty printer of libxml2 and then check each line if it is already short enough before treating it. Splitting long lines at whitespace is a pretty simple thing to do.
You lost me here... I tried using help() and google but got lost trying to find a reference to the libxml2 pretty printer you speak of.
If you want to serialize by walking the tree, then you should do that completely at the element level, preferably with code that also works for the original ElementTree.
Sorry, could you clarify your point here? I am easily confused. :) Thanks fir the feedback Stefan! Regards, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
I have examined the lxml pretty print output and there is quite some difference. In fact, with the exception of a few simplistic documents with no textual element content I cannot see what effect pretty_print has.
Interesting. Are you suggesting that this feature is actually not working for you? As far as I understand, you say that you get the normal one-line output when there is textual context in the XML? Because I cannot reproduce that:
print tostring(XML("<a><b>test test test</b></a>"), pretty_print=True) <a> <b>test test test</b> </a>
This is absolutely the expected result. May I ask what version of libxml2 you are using?
You also use a hook for treating the serialised byte stream, although I think it's a bad idea to do this for splitting elements between attributes.
Why do you think this is a bad idea?
A better place to do this would be a filter in the form of a file-like object, as this is much more generic and memory efficient. Something like class FileFilter(object): def __init__(self, out_file): self.out_file = out_file def write(self, data): # treat data self.out_file.write(new_data) This is a totally generic approach that does not rely on lxml, works with any XML stream (as long as you take care about encodings), even when copied directly from a file or things like that. But I still think that it would be better to do such adaptations based on walking the XML tree rather than the byte stream. One reason is that you are duplicating considerations about encodings and parsing that you wouldn't have in the XML infoset. You can always write your own serialiser based on element.getiterator(). Also, feel free to take a look how ElementTree does it.
I was a little hesitant about doing it in the first place because technically it is no longer XML processing, but string processing. As safe as I think my code is, it does feel like there aught to be a better way of doing it.
How do you deal with, say, a UTF-16 encoded XML byte stream?
Like I stated above - I still cannot see what pretty_print is actually doing. I do know that it is unsuitable for my purposes because of it's inability to word wrap and indent elements tags (which is the definition of pretty printing IMHO).
Well, it does indent element tags on my side, which (IMHO) is the definition of XML pretty printing. How is it supposed to know that adding whitespace to the data between tags does not break anything?
If you want to work on the byte stream, you might consider using the pretty printer of libxml2 and then check each line if it is already short enough before treating it. Splitting long lines at whitespace is a pretty simple thing to do.
You lost me here... I tried using help() and google but got lost trying to find a reference to the libxml2 pretty printer you speak of.
I meant the pretty_print option. Set it to true and use the above file filter approach by looking for "\n". But make sure the serialized result is in UTF-8 or unicode or another byte format that's readily usable by Python in a portable way. Stefan

Hello,
Interesting. Are you suggesting that this feature is actually not working for you?
Kind of - I can get it to indent simple documents. I have attached a sample file - if you run this through etree.tostring with pretty printing enabled it doesn't alter the output in any way (with the exception of converting some chars to entities, obviously).
This is absolutely the expected result. May I ask what version of libxml2 you are using?
2.6.24.dfsg-1
But I still think that it would be better to do such adaptations based on walking the XML tree rather than the byte stream.
I am as much as I can - but it is not possible to wrap element attributes programmaticaly - thus I have to relly on byte stream post-processing.
duplicating considerations about encodings and parsing that you wouldn't have in the XML infoset.
How do you deal with, say, a UTF-16 encoded XML byte stream?
The code I submitted seems to handle UTF-16 just fine, try throwing a UTF-16 document at it. Why would you think this was an issue? I18n issues are really very important to me - so I really need to understand this one... still trying to get my head around character encodings in general. I use a few regex's that search for '<' '>' '"' and '=' characters. Is this not safe across all encoded byte streams? This characters seem to match just fine using ASCII, LATIN-1, UTF-8 and UTF-16. If that is so - would it make sense to decode the byte stream to a Unicode object to perform string operations before encoding back to the original charset requested?
Well, it does indent element tags on my side, which (IMHO) is the definition of XML pretty printing. How is it supposed to know that adding whitespace to the data between tags does not break anything?
Okay... a few points on this one. I am still unable to figure out the rules it is using. As I previously mentioned - it does not seem to alter the document I have attached with this email. Secondly, your last point confuses me a little. The very act of indenting tags requires the addition of XML text node to the document tree - so by virtue of the fact you have implemented a pretty printer you are already adding white space to the document. Of am I missing something? All white space is significant in XML - so once you have decided to alter the document, the actual wrapping and indentation styles you choose are by the by. I suppose in this way, the only real difference between my pretty printer and the one built into lxml is the ability to control which element types are altered. Please feel free to knock me back into line... I'm probably missing something embarrassingly obvious. Thanks! Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
Interesting. Are you suggesting that this feature is actually not working for you?
Kind of - I can get it to indent simple documents. I have attached a sample file - if you run this through etree.tostring with pretty printing enabled it doesn't alter the output in any way (with the exception of converting some chars to entities, obviously).
I tried the file and I guess the problem is the whitespace it contains. I guess libxml2 will simply refuse to alter your data if it can't distinguish between relevant and ignorable whitespace. It will only add new whitespace text nodes. Note also that you likely parsed it without DTD. That prevents libxml2 from knowing where whitespace matters and where it doesn't.
But I still think that it would be better to do such adaptations based on walking the XML tree rather than the byte stream.
I am as much as I can - but it is not possible to wrap element attributes programmaticaly - thus I have to relly on byte stream post-processing.
If you implement a custom tree-walking serialiser, you have to write one attribute after the other anyway. So just check if the next one fits into the line and otherwise add a newline+indent first. How is that impossible?
The code I submitted seems to handle UTF-16 just fine, try throwing a UTF-16 document at it.
Why would you think this was an issue? I18n issues are really very important to me - so I really need to understand this one... still trying to get my head around character encodings in general.
I use a few regex's that search for '<' '>' '"' and '=' characters.
Is this not safe across all encoded byte streams? This characters seem to match just fine using ASCII, LATIN-1, UTF-8 and UTF-16.
Not every encoding makes the ASCII characters '<' etc. readily visible at a byte level. UTF-8 is perfect here, the ASCII-derived ISO-8859 charsets are also nice, but I guess EBCDIC is pretty much resistant and some Asian encoding will likely be, too.
If that is so - would it make sense to decode the byte stream to a Unicode object to perform string operations before encoding back to the original charset requested?
Use UTF-8, that's fast and perfectly suited for that purpose.
Secondly, your last point confuses me a little. The very act of indenting tags requires the addition of XML text node to the document tree - so by virtue of the fact you have implemented a pretty printer you are already adding white space to the document. Of am I missing something? All white space is significant in XML - so once you have decided to alter the document, the actual wrapping and indentation styles you choose are by the by.
I guess the rule is: If you don't know what the document is supposed to look like, 'adding whitespace nodes' is (most likely) less harmful than 'changing data between tags'.
I suppose in this way, the only real difference between my pretty printer and the one built into lxml is the ability to control which element types are altered.
libxml2 knows about the docbook namespace, too. Just try:
from lxml.etree import parse, tostring, XMLParser tree = parse("document.utf-8.xml", XMLParser(load_dtd=True)) tree.write("testout.xml", "UTF-8", pretty_print=True)
That will nicely pretty print the document you sent me. Stefan

Hi Stefan, Thank you for a great reply - I appreciate the effort your going to.
I tried the file and I guess the problem is the whitespace it contains. I guess libxml2 will simply refuse to alter your data if it can't distinguish between relevant and ignorable whitespace. It will only add new whitespace text nodes.
Aha, I see what you mean. This as a limitation with libxml2 IMO - quite different from xml.dom.minidom.toprettyxml and xml.dom.ext.PrettyPrint
Note also that you likely parsed it without DTD. That prevents libxml2 from knowing where whitespace matters and where it doesn't.
Wow, old skool - I am using Relax NG with my documents, aren't DTD deprecated? :)
If you implement a custom tree-walking serialiser, you have to write one attribute after the other anyway. So just check if the next one fits into the line and otherwise add a newline+indent first. How is that impossible?
I can be simple sometimes - this angle never occurred to me! Thanks so much for pointing it out - your suggestion makes perfect sense. The idea of writing my own serialiser does daunt me a little - I was hoping to piggy back on someone else's efforts (at least that way I would be more certain the code is conformant... heh). Do you have any ideas on where to start on this? Is there some baseclass I could extend? Is this possible with SAX (which I know nothing about) - so many questions.
Not every encoding makes the ASCII characters '<' etc. readily visible at a byte level. UTF-8 is perfect here, the ASCII-derived ISO-8859 charsets are also nice, but I guess EBCDIC is pretty much resistant and some Asian encoding will likely be, too.
Please pardon my naivety - you are of course 100% correct on this issue.
I guess the rule is: If you don't know what the document is supposed to look like, 'adding whitespace nodes' is (most likely) less harmful than 'changing data between tags'.
Hmm... not sure I agree on that, but over all I guess I take your point. Kind of irrelevant in my case however as I am writing a serialiser for my self and hence know exactly what is and isn't significant whitespace. Heh.
libxml2 knows about the docbook namespace, too. Just try:
As I mentioned before - does this work with Relax NG? That would be amazing! Once again, thank you for your continued help and for such a great package. Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
I tried the file and I guess the problem is the whitespace it contains. I guess libxml2 will simply refuse to alter your data if it can't distinguish between relevant and ignorable whitespace. It will only add new whitespace text nodes.
Aha, I see what you mean. This as a limitation with libxml2 IMO -
No, absolutely not. If your library modifies your /data/ when you tell it to do non-intrusive indenting, that's just wrong. It doesn't break HTML (which is rendered for people anyway), but it breaks more or less everything else.
Note also that you likely parsed it without DTD. That prevents libxml2 from knowing where whitespace matters and where it doesn't.
Wow, old skool - I am using Relax NG with my documents, aren't DTD deprecated? :)
Don't think so, but who cares? This is not about validation, only about access to structural information - all that must be known is which tags will never contain (textual) data so that whitespace can be added to these without breaking the 'real' data.
If you implement a custom tree-walking serialiser, you have to write one attribute after the other anyway. So just check if the next one fits into the line and otherwise add a newline+indent first. How is that impossible?
I can be simple sometimes - this angle never occurred to me! Thanks so much for pointing it out - your suggestion makes perfect sense. The idea of writing my own serialiser does daunt me a little - I was hoping to piggy back on someone else's efforts (at least that way I would be more certain the code is conformant... heh).
Do you have any ideas on where to start on this? Is there some baseclass I could extend? Is this possible with SAX (which I know nothing about) - so many questions.
SAX is one way of doing it. It mimics a parser and is therefore pretty well suited for serialization, but tends to require loads of code. If you prefer the ElementTree API, look at Element.getiterator(), which is even extremely fast in lxml (there should be code examples on the web). Note also that the original ElementTree library has a Python implemented serializer. You should look at it before writing your own one.
libxml2 knows about the docbook namespace, too. Just try:
As I mentioned before - does this work with Relax NG? That would be amazing!
Technically, it could, but there is no reason why libxml2 should support it. DTDs are available for virtually any well-known document type, including the docbook type you are using. Setting the "load_dtd" keyword on the parser should be relatively cheap (note that there is a separate "dtd_validation" keyword for validation). It relies on libxml2's catalog feature, though, and on the relevant DTDs to be installed on the system. You may want to read about DTDs, catalogs and XML Infosets for further information about the topics involved. Stefan

Hi Noah, Noah Slater wrote:
I have examined the lxml pretty print output and there is quite some difference. In fact, with the exception of a few simplistic documents with no textual element content I cannot see what effect pretty_print has.
I had to take another look at this while I was rewriting a part of the API for thread clean-ness. The easiest way to get your parsed document formatted is this:
parser = etree.XMLParser(ignore_blanks=True) tree = etree.parse(file, parser) tree.write(newfile, pretty_print=True)
The "ignore_blanks" option is new in the trunk and removes blank text nodes from the parsed tree. This allows libxml2 to add new white space for indentation without conflicting with left-overs from the original document. No DTD parsing needed. Stefan

Hi Stefan, Thanks for that! :) As it happens, I have now moved away from lxml for my pretty printing. Instead, especially given my desire to output well formed HTML/XHTML, I have written an ElementTree wrapper around uTidylib which takes care of pretty much anything I could possibly imagine. Thanks again, Noah On 29/05/06, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> wrote:
Hi Noah,
Noah Slater wrote:
I have examined the lxml pretty print output and there is quite some difference. In fact, with the exception of a few simplistic documents with no textual element content I cannot see what effect pretty_print has.
I had to take another look at this while I was rewriting a part of the API for thread clean-ness. The easiest way to get your parsed document formatted is this:
parser = etree.XMLParser(ignore_blanks=True) tree = etree.parse(file, parser) tree.write(newfile, pretty_print=True)
The "ignore_blanks" option is new in the trunk and removes blank text nodes from the parsed tree. This allows libxml2 to add new white space for indentation without conflicting with left-overs from the original document. No DTD parsing needed.
Stefan
-- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
As it happens, I have now moved away from lxml for my pretty printing. Instead, especially given my desire to output well formed HTML/XHTML, I have written an ElementTree wrapper around uTidylib which takes care of pretty much anything I could possibly imagine.
Hmmm, you know about the HTMLParser in lxml, don't you? Stefan
participants (2)
-
Noah Slater
-
Stefan Behnel