[lxml-dev] Difference between ResultTree.__str__ and ResultTree.write

Hi, Could we get the difference between ResultTree.__str__ and ResultTree.write properly documented somewhere ([1]?) on the website or via docstrings etc as there only seems to be one reference to it that I can find [2] By the way, the ability to change "output" parameters with write is an excellent addition to this package - such a time saver for me! :) Thanks, Noah [1] http://codespeak.net/lxml/api.html [2] http://codespeak.net/pipermail/lxml-dev/2006-May.txt -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater <nslater@gmail.com> wrote:
-- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Noah Slater wrote:
By using a sensible xsl:output element in the stylesheet?
How's this for a starter? http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Stefan

Hello Stefan,
Interesting response. If I had total control over the stylesheets I would do this, but even then it would still not offer a solution for me. As I have mentioned before I am writing a HTTP publication framework. All resources are stored and manipulated internally as DocBook. They are transformed towards the end of the request processing into a resource representation. The type of resource representation is determined using HTTP content negotiation. The result of the negotiation could request gzip'ed PDF, Shift-JIS encoded plain text or regular ISO 8859-1 encoded XHTML 1.1. Either way, I have a plethora of XSLT stylesheets which can perform these transformations at my request. The best part about using lxml for my purposes is the ability to use ResultTree.write with an encoding to get the desired result. This saves me from constructing a post-processing XSLT stylesheet in memory (as I did when I was using the libxsl bindings) just to change the character set at run-time. Now you see, here is my actual problem - the use case if you will - for being able to control the output method at runtime via ResultTree.write: If I want to convert my DocBook file to HTML 4.01 I face a bit of a problem. HTML 4.01 should not have any PIs (Such as the <?xml?> declaration), this this rules out any ability to declare the character encoding. Oh no! What do we do? Without declaring the character encoding in the file not only does it fail to validate as HTML (check with the W3C validation service) but it reduces the interoperability of my application because clients and UAs have to guess character encoding - which is a Bad Thing. You may come back with two responses to this: "Well who cares if you don't validate? It's only a small error." It may only be HTML, but I don't want Tim Bray to call me a bozo. [1] "Why don't you specify the correct encoding in the HTTP headers?" I already do. But what happens when the user saves the document to disk? What happens when the UA otherwise copies the file to a local repository? The character encoding is lost - never to be found again. Bozo time! But wait! There is a solution, just specify the the output method as "html" in the stylesheet, that should do it. Let's test with xsltproc from the command line. Yup, we get this in the HTML head generated for us: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> Okay, cool - let's test in my app. Oh wait, it's gone! Time to look though all the documentation. Starting off with [2] doesn't help me much however, so I continued searching. 15 minutes with google later and I find out it's because of a difference between ResultTree.write and ResultTree.__str__ So it would seem that I have two options I can serve documents up in different character encodings, but only as XHTML. I just lost 90% of the web who can't view XHTML properly. [3] Alternatively I could serve up documents in HTML and other formats, but only in UTF-8 (or another fixed encoding) which kinda sucks. Some clients may not be able to handle UTF-8 , so I just lost some more of my audience. I could always serve up using ASCII or some other variant - but I just lost over half the worlds population as potential users.
Once again, thank you for pointing this out - I had read it before as it happens. Unfortunately I don't have the source of lxml checked out - I'm sure I must have it on my system somewhere - but I don't know, and don't care, where. My point was that I, presumably like most other users, turn to the website for documentation which is sadly missing this information. I propose that even if it helps just one other person save those 15 minutes googling we should have this FAQ on the web with some obvious way into it from the home page. Thank you for your time. Regards, Noah [1] http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim [2] http://codespeak.net/lxml/api.html [3] http://hixie.ch/advocacy/xhtml -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
The result of an XSLT is an ElementTree. Feel free to do with it what you like. It's no problem to specify the output method in a second XSLT step. This is untested:
The little stylesheets are small enough to be a) generated on the fly b) cached c) quickly compiled by XSLT() after adapting the xsl:output element of a base XSL document (note that XSLT copies the document internally, so this works). Feel free to find out which is fastest. If you're working on a publishing framework, not being able to run a stylesheet cascade on a document would be pretty much of a no-go in my eyes.
I think that's a pretty quick thing to do with lxml.
Right, libxml2 actually has a special HTML output API, but that's not currently wrapped by lxml. I don't know what it does exactly, though. http://xmlsoft.org/html/libxml-HTMLtree.html
and I guess you meant that you used write() instead of str(). Because "str() does what you want"^TM. :) I think you still have some easy choices: post-process with an output stylesheet, keep multiple slightly different stylesheets in memory that you generate at startup time, modify the ResultTree before serializing (I never tested this, but I wouldn't know why it should not work), ...
Yeah, Martijn is sometimes a bit slow with updating the web page, but this time it's not really his fault. the FAQ is constantly evolving and since 1.0 is pretty close, we'll update the web page in one step when it comes out.
The "obvious way" is a link that's in the pages of the trunk. Just wait a few days and it will be online. Since you're working with a recent version anyway, you should not hesitate to refer to the SVN version of the documentation, though. Remember that the official current version is still 0.9.2, which the online documentation describes. Stefan

Thanks for your help, as always Stefan. I read an understood all that you say - however I am still left wondering if this is something we will ever be able to do via ResultTree.write? Is this a hard problem? Friendly, Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Oh, and a question: How do I force the output method (Such as "html" or "txt") from a ResultTree object? Thanks, Noah p.s. Congratulations Stefan! :) On 29/05/06, Noah Slater <nslater@gmail.com> wrote:
-- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Noah Slater wrote:
By using a sensible xsl:output element in the stylesheet?
How's this for a starter? http://codespeak.net/svn/lxml/trunk/doc/FAQ.txt Stefan

Hello Stefan,
Interesting response. If I had total control over the stylesheets I would do this, but even then it would still not offer a solution for me. As I have mentioned before I am writing a HTTP publication framework. All resources are stored and manipulated internally as DocBook. They are transformed towards the end of the request processing into a resource representation. The type of resource representation is determined using HTTP content negotiation. The result of the negotiation could request gzip'ed PDF, Shift-JIS encoded plain text or regular ISO 8859-1 encoded XHTML 1.1. Either way, I have a plethora of XSLT stylesheets which can perform these transformations at my request. The best part about using lxml for my purposes is the ability to use ResultTree.write with an encoding to get the desired result. This saves me from constructing a post-processing XSLT stylesheet in memory (as I did when I was using the libxsl bindings) just to change the character set at run-time. Now you see, here is my actual problem - the use case if you will - for being able to control the output method at runtime via ResultTree.write: If I want to convert my DocBook file to HTML 4.01 I face a bit of a problem. HTML 4.01 should not have any PIs (Such as the <?xml?> declaration), this this rules out any ability to declare the character encoding. Oh no! What do we do? Without declaring the character encoding in the file not only does it fail to validate as HTML (check with the W3C validation service) but it reduces the interoperability of my application because clients and UAs have to guess character encoding - which is a Bad Thing. You may come back with two responses to this: "Well who cares if you don't validate? It's only a small error." It may only be HTML, but I don't want Tim Bray to call me a bozo. [1] "Why don't you specify the correct encoding in the HTTP headers?" I already do. But what happens when the user saves the document to disk? What happens when the UA otherwise copies the file to a local repository? The character encoding is lost - never to be found again. Bozo time! But wait! There is a solution, just specify the the output method as "html" in the stylesheet, that should do it. Let's test with xsltproc from the command line. Yup, we get this in the HTML head generated for us: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> Okay, cool - let's test in my app. Oh wait, it's gone! Time to look though all the documentation. Starting off with [2] doesn't help me much however, so I continued searching. 15 minutes with google later and I find out it's because of a difference between ResultTree.write and ResultTree.__str__ So it would seem that I have two options I can serve documents up in different character encodings, but only as XHTML. I just lost 90% of the web who can't view XHTML properly. [3] Alternatively I could serve up documents in HTML and other formats, but only in UTF-8 (or another fixed encoding) which kinda sucks. Some clients may not be able to handle UTF-8 , so I just lost some more of my audience. I could always serve up using ASCII or some other variant - but I just lost over half the worlds population as potential users.
Once again, thank you for pointing this out - I had read it before as it happens. Unfortunately I don't have the source of lxml checked out - I'm sure I must have it on my system somewhere - but I don't know, and don't care, where. My point was that I, presumably like most other users, turn to the website for documentation which is sadly missing this information. I propose that even if it helps just one other person save those 15 minutes googling we should have this FAQ on the web with some obvious way into it from the home page. Thank you for your time. Regards, Noah [1] http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim [2] http://codespeak.net/lxml/api.html [3] http://hixie.ch/advocacy/xhtml -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman

Hi Noah, Noah Slater wrote:
The result of an XSLT is an ElementTree. Feel free to do with it what you like. It's no problem to specify the output method in a second XSLT step. This is untested:
The little stylesheets are small enough to be a) generated on the fly b) cached c) quickly compiled by XSLT() after adapting the xsl:output element of a base XSL document (note that XSLT copies the document internally, so this works). Feel free to find out which is fastest. If you're working on a publishing framework, not being able to run a stylesheet cascade on a document would be pretty much of a no-go in my eyes.
I think that's a pretty quick thing to do with lxml.
Right, libxml2 actually has a special HTML output API, but that's not currently wrapped by lxml. I don't know what it does exactly, though. http://xmlsoft.org/html/libxml-HTMLtree.html
and I guess you meant that you used write() instead of str(). Because "str() does what you want"^TM. :) I think you still have some easy choices: post-process with an output stylesheet, keep multiple slightly different stylesheets in memory that you generate at startup time, modify the ResultTree before serializing (I never tested this, but I wouldn't know why it should not work), ...
Yeah, Martijn is sometimes a bit slow with updating the web page, but this time it's not really his fault. the FAQ is constantly evolving and since 1.0 is pretty close, we'll update the web page in one step when it comes out.
The "obvious way" is a link that's in the pages of the trunk. Just wait a few days and it will be online. Since you're working with a recent version anyway, you should not hesitate to refer to the SVN version of the documentation, though. Remember that the official current version is still 0.9.2, which the online documentation describes. Stefan

Thanks for your help, as always Stefan. I read an understood all that you say - however I am still left wondering if this is something we will ever be able to do via ResultTree.write? Is this a hard problem? Friendly, Noah :) -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
participants (2)
-
Noah Slater
-
Stefan Behnel