[lxml-dev] PATCH for formatting XML output
Hi, I'm a bit new to the project, but had a need to nicely format the output form lxml. Also, I have a need to leave the <?xml version="1.0"?> header at the top. So, I made a few changes to the write procedure. There are now two more arguments that can be called: format=0/1 (0 = default) whether or not the output should be pretty printed strip=0/1 (1 = default) whether or not the xml document definition should be stripped when using us-ascii or utf-8 encoding. I've attached the patch to this email. --Patrick
Patrick Wagstrom wrote:
I'm a bit new to the project, but had a need to nicely format the output form lxml. Also, I have a need to leave the <?xml version="1.0"?> header at the top. So, I made a few changes to the write procedure. There are now two more arguments that can be called:
format=0/1 (0 = default) whether or not the output should be pretty printed
strip=0/1 (1 = default) whether or not the xml document definition should be stripped when using us-ascii or utf-8 encoding.
Since this isn't part of the ElementTree API (which lxml is heading to conform to), I'd personally prefer having output formatters implemented in an external formatting class rather than _ElementTree. Something like PrettyPrint in xml.dom.ext. Others may have different opinions on this. Stefan
On Sun, 2005-10-30 at 14:40 +0100, Stefan Behnel wrote:
Patrick Wagstrom wrote:
I'm a bit new to the project, but had a need to nicely format the output form lxml. Also, I have a need to leave the <?xml version="1.0"?> header at the top. So, I made a few changes to the write procedure. There are now two more arguments that can be called:
format=0/1 (0 = default) whether or not the output should be pretty printed
strip=0/1 (1 = default) whether or not the xml document definition should be stripped when using us-ascii or utf-8 encoding.
Since this isn't part of the ElementTree API (which lxml is heading to conform to), I'd personally prefer having output formatters implemented in an external formatting class rather than _ElementTree. Something like PrettyPrint in xml.dom.ext.
Others may have different opinions on this.
Once again, newbie disclaimer applies. I've done some digging on this, and all of the methods seem like they're going to require some sort of large performance hit in order to do, mainly because I'm going to be reimplementing a large portion of what libxml2 does underneath lxml to begin with. That's why I decided to add the extra two arguments to write, and make them optional, so in the most pure sense, compatability is maintained (ElementTree programs should work fine with lxml, not 100% the other way I guess). This made sense to me because I sorta saw lxml as a bit of a successor. I did however, see one bright spot. Apparently there may be a pretty printer in ElementTree at some point in the future as part of the ElementLib module. However, the last posting I can find relating to this is from March 2004[1], and I haven't been able to find where I can find the development version of ElementTree (maybe I'm just not looking in the right spot). If we're shooting for full compatibility, then lxml should follow the same syntax. Anyone know where I could find the proposed changes to ElementTree? Thanks! --Patrick [1] http://effbot.org/zone/element-lib.htm
Patrick Wagstrom wrote:
On Sun, 2005-10-30 at 14:40 +0100, Stefan Behnel wrote:
Patrick Wagstrom wrote:
I made a few changes to the write procedure. There are now two more arguments that can be called:
format=0/1 (0 = default) whether or not the output should be pretty printed
strip=0/1 (1 = default) whether or not the xml document definition should be stripped when using us-ascii or utf-8 encoding.
Since this isn't part of the ElementTree API (which lxml is heading to conform to), I'd personally prefer having output formatters implemented in an external formatting class rather than _ElementTree. Something like PrettyPrint in xml.dom.ext.
I've done some digging on this, and all of the methods seem like they're going to require some sort of large performance hit in order to do, mainly because I'm going to be reimplementing a large portion of what libxml2 does underneath lxml to begin with.
I didn't mean to use those classes, I was just commenting on the API. In the background, you'd obviously reuse what libxml2 has to offer.
That's why I decided to add the extra two arguments to write, and make them optional, so in the most pure sense, compatability is maintained (ElementTree programs should work fine with lxml, not 100% the other way I guess).
Well, having given it a bit more thought, I don't oppose your way of adding it anymore. Since it's both backwards compatible and an obvious enhancement of the API, why not just add it? Still, one thing: do not use 0/1 for the format argument. That's C-ish. You're working on a Python API here, so make that True/False. And: please, add a test case in tests/test_etree.py !
This made sense to me because I sorta saw lxml as a bit of a successor.
Actually, the real successor is cElementTree. :)
I did however, see one bright spot. Apparently there may be a pretty printer in ElementTree at some point in the future as part of the ElementLib module. However, the last posting I can find relating to this is from March 2004[1], and I haven't been able to find where I can find the development version of ElementTree (maybe I'm just not looking in the right spot). If we're shooting for full compatibility, then lxml should follow the same syntax. Anyone know where I could find the proposed changes to ElementTree? [1] http://effbot.org/zone/element-lib.htm
Since this is pretty old and V1.3 still seems to be pretty far from the door, maybe you'd have to ask Fredrik Lundh to see how real this extension has become. Otherwise, just go with the keyword arguments. Stefan
Hey, I just read the interesting discussion in the thread. Thanks guys! Using keyword arguments (with True/False, or perhaps some status code if more options are possible seems like a reasonable approach. Perhaps the default signature should include: pretty_print=False and prologue=False or something like that. The whole prologue story is a bit messy in lxml by the way; I did some hackery to create ElementTree compatibility in not showing the prologue but perhaps we can do something saner than what I did... Regards, Martijn
Martijn Faassen schrieb:
The whole prologue story is a bit messy in lxml by the way; I did some hackery to create ElementTree compatibility in not showing the prologue but perhaps we can do something saner than what I did...
My branch contains a fix for that. See diff of revisions 18903:18905 It is related to the bug found by Carlos Pita. Stefan
Hi there, Thinking about this some more, it might be nice to get a document/doctest particularly about serializing XML in various ways, pretty printing, c18n, and so on, all in one place. This we can then include in the doc directory. Any volunteers to write a little story with examples? Regards, Martijn
Martijn Faassen wrote:
Thinking about this some more, it might be nice to get a document/doctest particularly about serializing XML in various ways, pretty printing, c18n, and so on, all in one place. This we can then include in the doc directory.
Now that you mention it: Some of the non output-related test cases look sub-optimal to me as they test for a specific XML output instead of specific properties of the result tree. These are easily broken when we start fiddeling around with the XML serialization - even without changing the parts they are supposed to test... I was too lazy to change them so far - my remaining patch is sufficiently big and conflict prone already. Maybe it's best to use reverse-test-driven development: If the test breaks while you were doing something unrelated, it's time to fix it. :) That's also the simplest way of determining the right person for the clean-up. Stefan :]
Stefan Behnel wrote:
Martijn Faassen wrote:
Thinking about this some more, it might be nice to get a document/doctest particularly about serializing XML in various ways, pretty printing, c18n, and so on, all in one place. This we can then include in the doc directory.
Now that you mention it: Some of the non output-related test cases look sub-optimal to me as they test for a specific XML output instead of specific properties of the result tree. These are easily broken when we start fiddeling around with the XML serialization - even without changing the parts they are supposed to test...
I'm actually fine with testing XML output; such tests are easier to write and read both. Note that in many of those tests I run the output through c14n to make sure that differences in serialization are eliminated. Regards, Martijn
participants (3)
-
Martijn Faassen
-
Patrick Wagstrom
-
Stefan Behnel