[lxml-dev] Parsing Information

Hi all,
I have recently reimplemented RML (Reportlab's XML format to generate PDFs) using lxml. All works well.
Now, I would like to give my users some more information when an error occurs. For a pure XML parsing error, everything is fine (though I found the failure points hard to interpret at times). But what if the XML parses correctly, but while working with the element tree an error occurs? In this case I would like to tell the user not only the error message, but also the line/column and filename of point of failure.
Ideally I would have the filename, start row and start column of each element available as part of the etree Element. I have tried to find this information or hooks for it.unsuccessfully.
Could someone help me out here?
Regards, Stephan

Hi,
Stephan Richter wrote:
I have recently reimplemented RML (Reportlab's XML format to generate PDFs) using lxml. All works well.
Interesting. Any chance you could provide a link?
Now, I would like to give my users some more information when an error occurs. For a pure XML parsing error, everything is fine (though I found the failure points hard to interpret at times). But what if the XML parses correctly, but while working with the element tree an error occurs? In this case I would like to tell the user not only the error message, but also the line/column and filename of point of failure.
This sounds a lot like a problem you could try to solve with validation.
Ideally I would have the filename, start row and start column of each element available as part of the etree Element. I have tried to find this information or hooks for it.unsuccessfully.
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number already. So we could easily add a property "_line" to elements that returns the line number at which the element was parsed (*if* it was parsed). I don't like the fact so much that libxml2 puts a zero there if the node was created by hand, but I assume that is not too much of a problem either.
I personally prefer "_line" over "line", as this only applies to parsed elements, not all of them, so this is more of a half-working API. Additionally, any additional attribute there goes off the list of children accessible in objectify.
We could also consider adding an external utility module to provide helpers like this that are not really worth poluting the API. Something like
lxml.tools.lineof(element)
But maybe that's just overkill...
Any comments?
Stefan

On Friday 16 March 2007 11:49, Stefan Behnel wrote:
Hi,
Stephan Richter wrote:
I have recently reimplemented RML (Reportlab's XML format to generate PDFs) using lxml. All works well.
Interesting. Any chance you could provide a link?
Sure: http://svn.zope.org/z3c.rml/trunk/src/z3c/rml/
Now, I would like to give my users some more information when an error occurs. For a pure XML parsing error, everything is fine (though I found the failure points hard to interpret at times). But what if the XML parses correctly, but while working with the element tree an error occurs? In this case I would like to tell the user not only the error message, but also the line/column and filename of point of failure.
This sounds a lot like a problem you could try to solve with validation.
No, I cannot, since some stuff cannot be decided until I do Python calls. For example, I look up colors by names, but this is not a static list.
Ideally I would have the filename, start row and start column of each element available as part of the etree Element. I have tried to find this information or hooks for it.unsuccessfully.
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number already. So we could easily add a property "_line" to elements that returns the line number at which the element was parsed (*if* it was parsed). I don't like the fact so much that libxml2 puts a zero there if the node was created by hand, but I assume that is not too much of a problem either.
I think a zero is no problem. None would be better. :-)
I personally prefer "_line" over "line", as this only applies to parsed elements, not all of them, so this is more of a half-working API.
That would be perfect.
Additionally, any additional attribute there goes off the list of children accessible in objectify.
I don't understand this sentence. :-)
We could also consider adding an external utility module to provide helpers like this that are not really worth poluting the API. Something like
lxml.tools.lineof(element)
That would be icing on the cake; either way is fine, If you consider such a tool, I would probably call it "parseInfo" or so, where maybe the filename, endline, and column info is available too.
Any comments?
How fast can you do this? :-)
Regards, Stephan

Hi,
Stephan Richter wrote:
Ideally I would have the filename, start row and start column of each element available as part of the etree Element. I have tried to find this information or hooks for it.unsuccessfully.
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number already. So we could easily add a property "_line" to elements that returns the line number at which the element was parsed (*if* it was parsed). I don't like the fact so much that libxml2 puts a zero there if the node was created by hand, but I assume that is not too much of a problem either.
I think a zero is no problem. None would be better. :-)
Problem is: how would you distinguish 'parsed in line 0' from 'not parsed at all' in this case?
Additionally, any additional attribute there goes off the list of children accessible in objectify.
I don't understand this sentence. :-)
I was talking about lxml.objectify that uses Python object attributes to access XML element children (sort of like data binding to an object tree). Every name that is used as a Python attribute of the _Element class shadows XML children that would otherwise be accessible under that name. Check out the objectify docs to see what I mean.
We could also consider adding an external utility module to provide helpers like this that are not really worth poluting the API. Something like
lxml.tools.lineof(element)
That would be icing on the cake; either way is fine, If you consider such a tool, I would probably call it "parseInfo" or so, where maybe the filename, endline, and column info is available too.
The filename would be available from documents, I don't know what you mean with "endline" (the last line number?) and the parser column is not available from libxml2 (at least not once the parser has passed the element...)
So, what about an 'lxml.docinfo' module then that provides this kind of info helper functions? I was never really happy with the DocInfo class, so it might be a good idea to just move this kind of information to a separate module that people can use if they need it.
I'm pretty confident that there is even more that we could provide at that level. And it would help us in keeping the already bigger-than-big-enough API of lxml at least a little smaller.
Stefan

Stefan Behnel wrote: [snip]
We could also consider adding an external utility module to provide helpers like this that are not really worth poluting the API. Something like
lxml.tools.lineof(element)
That would be icing on the cake; either way is fine, If you consider such a tool, I would probably call it "parseInfo" or so, where maybe the filename, endline, and column info is available too.
The filename would be available from documents, I don't know what you mean with "endline" (the last line number?) and the parser column is not available from libxml2 (at least not once the parser has passed the element...)
So, what about an 'lxml.docinfo' module then that provides this kind of info helper functions? I was never really happy with the DocInfo class, so it might be a good idea to just move this kind of information to a separate module that people can use if they need it.
I'm pretty confident that there is even more that we could provide at that level. And it would help us in keeping the already bigger-than-big-enough API of lxml at least a little smaller.
I really think this is overkill. I think an attribute 'line' is fine. lxml has an explicit mission to take ElementTree and expand its API with more functionality. We do this with namespaces, we do this with xpath, and why wouldn't we do this with line numbers? I don't understand how line numbers are different.
By the way, even if 0 is both used for line 0 and elements that have an unknown line number, it seems actually possible to distinguish between the two! What would be required if 'line 0' is found is to go backwards in document order, until a textnode is found that contains a newline. If so, the answer is None. If not (and this can be done quickly), the answer is 0. Oh, possibly even more efficient would be to look for *another* node. If this node contains a line number that's non-0, you know you can return None. That would make the 'line' API pretty reliable.
Regards,
Martijn

Stefan Behnel wrote: [snip]
I personally prefer "_line" over "line", as this only applies to parsed elements, not all of them, so this is more of a half-working API. Additionally, any additional attribute there goes off the list of children accessible in objectify.
I really don't like _line. The underscore in a strong Python convention indicates "implementation details", and code external to a class should *not* be touching attributes which start with an underscore unless it knows it's going to do something evil. Initial underscores are not meant to indicate half-working APIs or something.
Accessing _line is not evil, it's just not guaranteed to be correct if you manipulate a parsed tree, or create a tree from scratch. This should simply be documented.
(Are we sure it's half-working, anyway? Does libxml2 start counting lines at 0 or at 1? If at 1 then 0 is entirely unambiguous and we may be able to return None instead reliably)
Anyway, to conclude, I think 'line' is just fine - I believe that's a complete API, if only not a great one if we can't distinguish between line 0 and "no line known".
Regards,
Martijn

Hi everyone,
Stefan Behnel wrote:
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number already. So we could easily add a property "_line" to elements that returns the line number at which the element was parsed (*if* it was parsed). I don't like the fact so much that libxml2 puts a zero there
Sorry for the FUD. I just checked and found that libxml2 is actually smarter than I remembered from the last time I looked at this. It gives you a 1 for the first line in the parser. So it's actually easy to distinguish between "no line known" and "parsed in line x".
That makes "el.line" a perfectly working API. I called it "el.sourceline" though, to make it clearer that only parsing XML source produces it, not creating Elements in any other way. I also made it writable, just in case someone wants to add line numbers to generated trees or something.
I'll commit it to the trunk. Note that it's not yet supported in objectify for now, which requires explicit special casing.
Have fun, Stefan

On Tuesday 20 March 2007 18:14, Stefan Behnel wrote:
I'll commit it to the trunk. Note that it's not yet supported in objectify for now, which requires explicit special casing.
Cool, I'll check out the trunk then.
Regards, Stephan

Stefan Behnel wrote:
Hi everyone,
Stefan Behnel wrote:
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number already. So we could easily add a property "_line" to elements that returns the line number at which the element was parsed (*if* it was parsed). I don't like the fact so much that libxml2 puts a zero there
Sorry for the FUD. I just checked and found that libxml2 is actually smarter than I remembered from the last time I looked at this. It gives you a 1 for the first line in the parser. So it's actually easy to distinguish between "no line known" and "parsed in line x".
That makes "el.line" a perfectly working API. I called it "el.sourceline" though, to make it clearer that only parsing XML source produces it, not creating Elements in any other way. I also made it writable, just in case someone wants to add line numbers to generated trees or something.
Is there a file or resource name in there somewhere too? This would be nice to have if, say, you were using xinclude to combine elements from different sources.

Hi Ian,
Ian Bicking wrote:
Stefan Behnel wrote:
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number
Is there a file or resource name in there somewhere too? This would be nice to have if, say, you were using xinclude to combine elements from different sources.
No, that's only stored at a per-document level (which makes sense IMHO).
Regards, Stefan

Stefan Behnel wrote:
Hi Ian,
Ian Bicking wrote:
Stefan Behnel wrote:
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number
Is there a file or resource name in there somewhere too? This would be nice to have if, say, you were using xinclude to combine elements from different sources.
No, that's only stored at a per-document level (which makes sense IMHO).
What would you do then if you create a document with multiple sources? E.g., if you use xinclude to include elements from different sources into a single document. The line numbers will be nonsense at that point, and there's no clear place to keep track of the real source.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ian Bicking wrote:
Stefan Behnel wrote:
Hi Ian,
Ian Bicking wrote:
Stefan Behnel wrote:
There is no API for it, but internally, we have this information for parsed trees, at least the line number - note that exceptions contain the line number
Is there a file or resource name in there somewhere too? This would be nice to have if, say, you were using xinclude to combine elements from different sources.
No, that's only stored at a per-document level (which makes sense IMHO).
What would you do then if you create a document with multiple sources? E.g., if you use xinclude to include elements from different sources into a single document. The line numbers will be nonsense at that point, and there's no clear place to keep track of the real source.
Logically, wouldn't the xincluded node have its "own" document reference, with correct filename / URL, since it is just "borrowed" into the including document? I don't know if lxml's / ETree's semantics support such a notion, however.
Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com

Tres Seaver wrote:
Ian Bicking wrote:
Is there a file or resource name in there somewhere too? This would be nice to have if, say, you were using xinclude to combine elements from different sources.
No, that's only stored at a per-document level (which makes sense IMHO).
What would you do then if you create a document with multiple sources? E.g., if you use xinclude to include elements from different sources into a single document. The line numbers will be nonsense at that point, and there's no clear place to keep track of the real source.
Right. What else should the line number be? It's the line in which the element was found by the parser. If you mix element from different document, this information becomes meaningless.
Logically, wouldn't the xincluded node have its "own" document reference, with correct filename / URL, since it is just "borrowed" into the including document?
No. It will refer to the document that contains it (after the inclusion).
I don't know if lxml's / ETree's semantics support such a notion, however.
No. All elements in a document should always refer to this document.
Stefan
participants (5)
-
Ian Bicking
-
Martijn Faassen
-
Stefan Behnel
-
Stephan Richter
-
Tres Seaver