a question about pretty_print
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
There is somethingI don’t understand about the behaviour of the pretty_print function (or is it a method??) I work exclusively with linguistically annotated texts where every token is wrapped in a <w> element. And pretty_print does a nice job with it. I often edit these files, updating, splitting, joining, or deleting particular element. If I create another element and use ‘addnext’ to insert it as a right sibling, pretty_print fails and doesn’t print it in a new line. Something like <w lemma="the" pos="d" xml:id="b2afn-048-a-0570" ana="the/d">The</w> <w xml:id="b2afn-048-a-0580" lemma="〈…〉" pos="zz" ana="〈…〉/zz">〈…〉</w> Becomes <w lemma="there" pos="av" xml:id="b2afn-048-a-0570" ana="the/d">There</w> <w xml:id="b2afn-048-a-0580" lemma="be" pos="vvb" ana="〈…〉/zz" >be</w><w xml:id="b2afn-048-a-0581" lemma="n2" pos="n2" reg="ravens">rauyns</w> As I write this, it occurs to me that this may have nothing with pretty_print but with what addnext does or doesn’t do. But is there a routine that would guarantee that newly inserted element would by default display with the same indentation as its left sibling? MM
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Frederik Elwert schrieb am 18.07.19 um 11:01:
Correct. Here is an example: … <x>hel</x><y>lo</y> … In this case, adding a newline between the "x" and "y" tags would cut the word "hello" in two, and that is something that the heuristic tries to avoid. But, if you set the tail text of the "x" tag above to some whitespace string (space, newline, etc.), then it can't hurt much to insert a newline and further indentation in this spot, so the pretty-printer would do that. This is pretty much what happens when new tags are inserted that do not have any tail text set, which is why the pretty-printer handles them more carefully. Stefan
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
Now I understand. In my work, everything is wrapped in elements. I s there a way of overriding this precaution through another heuristic? It would be simple enough to fix things at a later stage by putting a linebreak any sequence like </w><w , but there ought to be a more elegant solution. MM On 7/20/19, 2:23 PM, "lxml on behalf of Stefan Behnel" <lxml-bounces@lxml.de on behalf of stefan_ml@behnel.de> wrote: Frederik Elwert schrieb am 18.07.19 um 11:01: > The issue here is that lxml uses some kind of heuristic to determine if the > whitespace already present in a document is meaningful and should be preserved > as is, or if it can be changed during prettyprinting. Correct. Here is an example: … <x>hel</x><y>lo</y> … In this case, adding a newline between the "x" and "y" tags would cut the word "hello" in two, and that is something that the heuristic tries to avoid. But, if you set the tail text of the "x" tag above to some whitespace string (space, newline, etc.), then it can't hurt much to insert a newline and further indentation in this spot, so the pretty-printer would do that. This is pretty much what happens when new tags are inserted that do not have any tail text set, which is why the pretty-printer handles them more carefully. Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=icUvpDR8JARftHYVg6Fopp3vxWPHrGqCqzouFt4WF2k&s=uWiW_IX0RU2n4QNnJfVUPeYGFMc8LNRy_jITKw8hou4&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=icUvpDR8JARftHYVg6Fopp3vxWPHrGqCqzouFt4WF2k&s=S2zJhE4GvlkDWQakNyCl-6g7bh-YnHRHuG3cZlIwHuo&e=
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Frederik Elwert schrieb am 18.07.19 um 11:01:
Correct. Here is an example: … <x>hel</x><y>lo</y> … In this case, adding a newline between the "x" and "y" tags would cut the word "hello" in two, and that is something that the heuristic tries to avoid. But, if you set the tail text of the "x" tag above to some whitespace string (space, newline, etc.), then it can't hurt much to insert a newline and further indentation in this spot, so the pretty-printer would do that. This is pretty much what happens when new tags are inserted that do not have any tail text set, which is why the pretty-printer handles them more carefully. Stefan
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
Now I understand. In my work, everything is wrapped in elements. I s there a way of overriding this precaution through another heuristic? It would be simple enough to fix things at a later stage by putting a linebreak any sequence like </w><w , but there ought to be a more elegant solution. MM On 7/20/19, 2:23 PM, "lxml on behalf of Stefan Behnel" <lxml-bounces@lxml.de on behalf of stefan_ml@behnel.de> wrote: Frederik Elwert schrieb am 18.07.19 um 11:01: > The issue here is that lxml uses some kind of heuristic to determine if the > whitespace already present in a document is meaningful and should be preserved > as is, or if it can be changed during prettyprinting. Correct. Here is an example: … <x>hel</x><y>lo</y> … In this case, adding a newline between the "x" and "y" tags would cut the word "hello" in two, and that is something that the heuristic tries to avoid. But, if you set the tail text of the "x" tag above to some whitespace string (space, newline, etc.), then it can't hurt much to insert a newline and further indentation in this spot, so the pretty-printer would do that. This is pretty much what happens when new tags are inserted that do not have any tail text set, which is why the pretty-printer handles them more carefully. Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=icUvpDR8JARftHYVg6Fopp3vxWPHrGqCqzouFt4WF2k&s=uWiW_IX0RU2n4QNnJfVUPeYGFMc8LNRy_jITKw8hou4&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=icUvpDR8JARftHYVg6Fopp3vxWPHrGqCqzouFt4WF2k&s=S2zJhE4GvlkDWQakNyCl-6g7bh-YnHRHuG3cZlIwHuo&e=
participants (3)
-
Frederik Elwert
-
Martin Mueller
-
Stefan Behnel