Mailman 3 fomrating lxml output - lxml - The Python XML Toolkit

March 28, 2019

      Could the documentation of lxml include some explanation and examples of how to format output? There is very little of it in the current documentation, and I haven’t found anything about it in Shipman’s otherwise excellent side.  There is something about it in Frederick Lund’s ElementTree site, but I can’t quite make sense of it, and since lxml sems to have established itself as the go-to Python place for munging XML, it should have self-contained documentation about every aspect of that munging.

lxml has a “pretty-Print=True” keyword. It would be better if the ‘magic’ of it were explained and it were modifiable.   It works for most of my purposes, but has some quirks.  My files are all text files where every token is a <w> element. Sometimes I split an existing token and create a new token using ‘addnext’ to insert it. Whenever I do that, pretty_print doesn’t put the new element in a new line. This puzzles me because, as far as I know, pretty_print operates on a tree when it is written out. How would it know or care whether a <w> element has been in that tree forever or has just been added?

Many formatting routines seem to rely on the number of nestings, adding a space or tab character for every additional level. This does not make a whole lot of sense for linguistically annotated texts where every token is a leaf node wrapped in a <w> or <pc> element and that leaf node may be eight or more levels below the top level. nor does it make sense in that environment to distinguish between inline and block elements for the purpose of formatting. The simplest, most economic, and most readable formatting would take the form of saying:

  1.  Start all new elements at the left margin, except for <pc> and <w>
  2.  Indent <pc> and <w> by two space

Or alternately:

  1.  Put all <w> and <pc> elements at the beginning of a line
  2.  Indent all other elements by two spaces

I’d be very grateful for any advice on how to do this. And I think it would be a good thing if the lxml documentation included a section on formatting choices, preferably with an explanation and examples.

fomrating lxml output

Martin Mueller

Stefan Behnel

Peter Van Epp

tags

participants (3)