data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
Could the documentation of lxml include some explanation and examples of how to format output? There is very little of it in the current documentation, and I haven’t found anything about it in Shipman’s otherwise excellent side. There is something about it in Frederick Lund’s ElementTree site, but I can’t quite make sense of it, and since lxml sems to have established itself as the go-to Python place for munging XML, it should have self-contained documentation about every aspect of that munging. lxml has a “pretty-Print=True” keyword. It would be better if the ‘magic’ of it were explained and it were modifiable. It works for most of my purposes, but has some quirks. My files are all text files where every token is a <w> element. Sometimes I split an existing token and create a new token using ‘addnext’ to insert it. Whenever I do that, pretty_print doesn’t put the new element in a new line. This puzzles me because, as far as I know, pretty_print operates on a tree when it is written out. How would it know or care whether a <w> element has been in that tree forever or has just been added? Many formatting routines seem to rely on the number of nestings, adding a space or tab character for every additional level. This does not make a whole lot of sense for linguistically annotated texts where every token is a leaf node wrapped in a <w> or <pc> element and that leaf node may be eight or more levels below the top level. nor does it make sense in that environment to distinguish between inline and block elements for the purpose of formatting. The simplest, most economic, and most readable formatting would take the form of saying: 1. Start all new elements at the left margin, except for <pc> and <w> 2. Indent <pc> and <w> by two space Or alternately: 1. Put all <w> and <pc> elements at the beginning of a line 2. Indent all other elements by two spaces I’d be very grateful for any advice on how to do this. And I think it would be a good thing if the lxml documentation included a section on formatting choices, preferably with an explanation and examples.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Mueller schrieb am 28.03.19 um 19:03:
Could the documentation of lxml include some explanation and examples of how to format output?
It … could, sure. :)
There is very little of it in the current documentation, and I haven’t found anything about it in Shipman’s otherwise excellent side. There is something about it in Frederick Lund’s ElementTree site, but I can’t quite make sense of it
You probably mean this formatting recipe: http://effbot.org/zone/element-lib.htm#prettyprint lxml refers to that in its FAQ: https://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml...
and since lxml sems to have established itself as the go-to Python place for munging XML, it should have self-contained documentation about every aspect of that munging.
lxml has a “pretty-Print=True” keyword. It would be better if the ‘magic’ of it were explained and it were modifiable. It works for most of my purposes, but has some quirks.
It's C implemented, fast, and not very configurable. For everything else, there's the Python recipe.
My files are all text files where every token is a <w> element. Sometimes I split an existing token and create a new token using ‘addnext’ to insert it. Whenever I do that, pretty_print doesn’t put the new element in a new line. This puzzles me because, as far as I know, pretty_print operates on a tree when it is written out. How would it know or care whether a <w> element has been in that tree forever or has just been added?
You can influence output formatting through the ".tail" text. As a heuristic, lxml's pretty-print assumes that a series of sibling elements that contain text between them are document-style XML which would suffer from inserting whitespace. So it doesn't do it.
If you need a special-purpose indentation algorithm for your specific XML format, it's usually best to adapt the Python recipe in a way that inserts indentation in the right places. Does this help? Stefan
data:image/s3,"s3://crabby-images/1aa9f/1aa9f7a8a0769d414ae079fe57575dc2d76f195c" alt=""
For what its worth I needed some special purpose formatting that I couldn't figure out how to do with the listed python function so I hacked something together in a really ugly way (I hadn't used python or lxml before). It is available in the PP section (which pretty prints a svg file) of this github repo if it helps: https://github.com/vanepp/FritzingCheckPart If you find a better way to do it I'd appreciate hearing of it (there almost certainly is one :-) .) Peter Van Epp
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Mueller schrieb am 28.03.19 um 19:03:
Could the documentation of lxml include some explanation and examples of how to format output?
It … could, sure. :)
There is very little of it in the current documentation, and I haven’t found anything about it in Shipman’s otherwise excellent side. There is something about it in Frederick Lund’s ElementTree site, but I can’t quite make sense of it
You probably mean this formatting recipe: http://effbot.org/zone/element-lib.htm#prettyprint lxml refers to that in its FAQ: https://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml...
and since lxml sems to have established itself as the go-to Python place for munging XML, it should have self-contained documentation about every aspect of that munging.
lxml has a “pretty-Print=True” keyword. It would be better if the ‘magic’ of it were explained and it were modifiable. It works for most of my purposes, but has some quirks.
It's C implemented, fast, and not very configurable. For everything else, there's the Python recipe.
My files are all text files where every token is a <w> element. Sometimes I split an existing token and create a new token using ‘addnext’ to insert it. Whenever I do that, pretty_print doesn’t put the new element in a new line. This puzzles me because, as far as I know, pretty_print operates on a tree when it is written out. How would it know or care whether a <w> element has been in that tree forever or has just been added?
You can influence output formatting through the ".tail" text. As a heuristic, lxml's pretty-print assumes that a series of sibling elements that contain text between them are document-style XML which would suffer from inserting whitespace. So it doesn't do it.
If you need a special-purpose indentation algorithm for your specific XML format, it's usually best to adapt the Python recipe in a way that inserts indentation in the right places. Does this help? Stefan
data:image/s3,"s3://crabby-images/1aa9f/1aa9f7a8a0769d414ae079fe57575dc2d76f195c" alt=""
For what its worth I needed some special purpose formatting that I couldn't figure out how to do with the listed python function so I hacked something together in a really ugly way (I hadn't used python or lxml before). It is available in the PP section (which pretty prints a svg file) of this github repo if it helps: https://github.com/vanepp/FritzingCheckPart If you find a better way to do it I'd appreciate hearing of it (there almost certainly is one :-) .) Peter Van Epp
participants (3)
-
Martin Mueller
-
Peter Van Epp
-
Stefan Behnel