[lxml-dev] Question about newlines
Hey, When serialising a document there are two places that I would expect lxml to insert newlines and yet there are non. 1) When adding a PI via the element.addprevious method and PI has it's tail trimmed and so when serialising the PI runs into the root element. 2) At the very end of the document. POSIX states that all files must end in a newline so I consider this to be a bug. Perhaps I am missing something, help is much appreciated! :) Thanks, -- Noah Slater <http://bytesexual.org/> "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
Hi, Noah Slater wrote:
When serialising a document there are two places that I would expect lxml to insert newlines and yet there are non.
Serialisation will never alter content. That said, there is a separate serialisation API in libxml2 (the xmlSave* functions) that inserts a newline at the end, maybe also after PIs (don't know). But it will not be used in lxml for a while due to API stability issues.
1) When adding a PI via the element.addprevious method and PI has it's tail trimmed and so when serialising the PI runs into the root element.
2) At the very end of the document. POSIX states that all files must end in a newline so I consider this to be a bug.
XML works on more systems than those that support POSIX. :) One reason these don't happen automatically is that ET doesn't insert newlines either. This is not a hard reason, and maybe we could even change this in 2.0. I'll think about it. Any other opinions on this? Stefan
On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote:
Serialisation will never alter content. [snip]
1) When adding a PI via the element.addprevious method and PI has it's tail trimmed and so when serialising the PI runs into the root element.
Well, this is well and good but lxml REMOVES the PI tail so I cannot insert a newline even if I want to. -- Noah Slater <http://bytesexual.org/> "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
Noah Slater wrote:
On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote:
Serialisation will never alter content. [snip]
1) When adding a PI via the element.addprevious method and PI has it's tail trimmed and so when serialising the PI runs into the root element.
Well, this is well and good but lxml REMOVES the PI tail so I cannot insert a newline even if I want to.
Ah, got it. Thanks for insisting. :) lxml.etree does this on purpose. If you allow character data around the processing instructions that you add as siblings of the root node, you need to make sure it's only whitespace (not 'real' data) to keep the in-memory tree well-formed and to serialise well-formed XML. So the behaviour would be: strip the tail, but keep it if it's whitespace. Sounds a bit ugly to me... I also noted that libxml2's parser drops whitespace at the root level, which is perfectly fine, as it is the most definitely ignorable whitespace there is. I personally prefer having lxml add a line break when serialising processing instructions and comments at the root level, and cosistently dropping all tail text of PIs and comments appended/prepended to a root node. So the behaviour for the root level would be: drop all whitespace when parsing, and add line breaks around PIs and comments on serialisation. There's also the document ending issue. The document serialiser of libxml2 does append a newline, and one day, lxml may switch to using it. So I added this behaviour now - and had to adapt tons of test cases that compare serialised XML between ET and lxml. But I don't mind having white-space differences in the serialisation as long as it's well-formed, equivalent XML. Stefan
Stefan Behnel wrote:
Noah Slater wrote:
On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote:
Serialisation will never alter content. [snip]
1) When adding a PI via the element.addprevious method and PI has it's tail trimmed and so when serialising the PI runs into the root element. Well, this is well and good but lxml REMOVES the PI tail so I cannot insert a newline even if I want to.
Ah, got it. Thanks for insisting. :)
So the behaviour for the root level would be: drop all whitespace when parsing, and add line breaks around PIs and comments on serialisation.
..., but only if pretty printing is requested. I think that's all that's needed here. Stefan
Noah Slater <nslater@bytesexual.org> writes:
2) At the very end of the document. POSIX states that all files must end in a newline
I disagree. Its definition of Text File does strictly say that. However the rationale implies that the special cases of empty file and file ending in an incomplete line might also be added to the strict definition. The specifications for ex and sort explicity allow for an incomplete line, i.e. file not ending with newline. The specs for other utilities that read text files are silent on the matter. That means that it is left to the quality of implementation whether such files are read correctly. In any case, XML files are probably better thought of as binary from a POSIX point of view. The line length must be restricted to LINE_MAX to qualify as text. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent pete.forman@westerngeco.com -./\.- the opinion of Schlumberger or http://petef.port5.com -./\.- WesternGeco.
participants (3)
-
Noah Slater
-
Pete Forman
-
Stefan Behnel