[lxml-dev] XMl Processing Instructions
Hello, Why I use the write method of the ElementTree class why does it strip out the XML processing insturctions? I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
Hi Noah, Noah Slater wrote:
Why I use the write method of the ElementTree class why does it strip out the XML processing insturctions?
I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8.
Hmm, I didn't verify this, although I actually thought lxml produced a declaration here. If not, this should be considered a bug, as it is likely inconsistent with ElementTree. I guess this is the same problem as for tostring(), which only started having the expected behaviour fairly recently. I'll see what I can do about that and try to fix it on the SVN trunk as soon as I find the time. Stefan
Hi again, Stefan Behnel wrote:
Noah Slater wrote:
Why I use the write method of the ElementTree class why does it strip out the XML processing insturctions?
I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8.
I guess this is the same problem as for tostring(), which only started having the expected behaviour fairly recently.
Yes, it /is/ the same problem. You will also notice problems when you serialise trees to XML byte streams containing 0-bytes. Both problems have been fixed on the trunk recently, but after the release of 0.9.2. Please use the Subversion trunk for now, until we have decided if it's worth releasing a 0.9.3 before we have 1.0 ready. http://codespeak.net/svn/lxml/trunk Stefan
Hello, If I understand your email correctly the behaviour I describe is not intentional and will be fixed shortly? Just to clarify - I can parse any file, but when I serialise I loose any processing instructions. This includes the <?xml ... ?> declaration. This also happens with the ResultTree (?) when I transform using XSLT. As an example, I use the DocBook XSLT stylesheets to transform DocBook XML. This can often set various things up with the processing instructions - character encoding being the most important. When I perform these transformations using ElementTree I loose this information. As a work around at the moment I am using lxml.etree to do the transformations using UTF-8 as the encoding. I am then using libxml2 and libxslt to transform the serialized document bytestream a second time with the only operation being converting from UTF-8 to another (variable) character encoding. This feels quite hackish - and to be honest the whole point of me moving to lxml was because I find the libxml2 and libxslt bindings hateful. To summarise, in an ideal world I would like to be able to transform a document using XSLT specifying an encoding at transformation time and have the ResultTree serialise with all processing instructions intact. Additionally I would like to be able to access these programmatically - which I don't think is possible at the moment. I hope all this makes sense. Thanks, Noah
Hi guys, Sorry, just noticed this now:
I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8.
Hmm, I didn't verify this, although I actually thought lxml produced a declaration here. If not, this should be considered a bug, as it is likely inconsistent with ElementTree. I guess this is the same problem as for tostring(), which only started having the expected behaviour fairly recently.
I disagree on your last point - I think tostring's utility comes from it's standalone nature - i.e. no XML declaration, PIs etc. While I think the write/write_c14n methods on an ElementTree should produce the PIs (XML declaration included) I do not think that simple Element serialisation should include an XML declaration. I am not sure about how other people use it, but in my case I am using etree.tostring to generate and analyse XML fragments in isolation - I would not want an XML declaration messing things up. I intend to post some code to this list in the next few days, which by coincidence will demonstrate my particular use case for tostring. I hope this makes sense. Thanks, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
Hi Noah, Noah Slater wrote:
I would like my documents to start with the processing instruction so I can specify encodings other than UTF-8.
Hmm, I didn't verify this, although I actually thought lxml produced a declaration here. If not, this should be considered a bug, as it is likely inconsistent with ElementTree. I guess this is the same problem as for tostring(), which only started having the expected behaviour fairly recently.
I disagree on your last point - I think tostring's utility comes from it's standalone nature - i.e. no XML declaration, PIs etc. While I think the write/write_c14n methods on an ElementTree should produce the PIs (XML declaration included) I do not think that simple Element serialisation should include an XML declaration.
tostring() and write() now produce XML declarations just as ElementTree does. You can switch them off for tostring() by passing "xml_declaration=False", which is consistent with ET 1.3 (as Fredrik told me). Stefan
participants (2)
-
Noah Slater
-
Stefan Behnel