data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
I have texts that are linguistically and every token is wrapped in a <w> or <pc> element, like so: <l xml:id="c0abe-e140"> <w lemma="and" pos="cc" xml:id="c0abe-001-a-0670">and</w> <w lemma="therein" pos="av" xml:id="c0abe-001-a-0680">therein</w> <w lemma="take" pos="vvb" xml:id="c0abe-001-a-0690">take</w> <w lemma="delight" pos="n1" xml:id="c0abe-001-a-0700">delight</w> <pc xml:id="c0abe-001-a-0710">:</pc> </l> I'd like to strip the annotation and generate a plain file in which the <w> and <pc> elements are replaced by just their content. Some years ago I used the strip_tags method, and I remember it working. By this I mean that the words in a line of text appeared with a space between words. But when I use it now, it does strip the tags but the text appears verticalized, with one token on each line. I've tried various methods of getting rid of the line break, but none of them work. For instance, I used this bit of code to replace the linebreak with nothing: for element in tree.iter('*'): plaintext = element.text if plaintext is not None: plaintext = plaintext.replace('\n', '') This does something: if I print out each element, the line breaks go away. But if I then print out the whole file, using print(etree.tostring(tree, encoding = 'unicode', pretty_print=True) the result is a verticalized text. Which means that the deletion of the line breaks at a later stage, but I don’t know how to do that. I'll be grateful for advice Martin Mueller Professor English and Classics Northwestern University
data:image/s3,"s3://crabby-images/5e181/5e181be423bbdc478f08a6aa349d857671e9854d" alt=""
You could do something like:
' '.join([ s.strip() for s in etree.parse('/tmp/x.xml').xpath( '//*/text()' ) ] ) ' and therein take delight : ‘
Or you can try using the parser option to remove whitespace text nodes:
etree.parse( '/tmp/x.xml', etree.XMLParser( remove_blank_text = True ) ).xpath( '//*/text()' ) ['and', 'therein', 'take', 'delight', ':’]
— Steve M.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Mueller schrieb am 14.09.20 um 22:50:
The line breaks in your example above are stored in the "w" elements' ".tail", not in their contained ".text". The following might help you to get the word spacings normalised: for element in tree.iter("w"): if element.tail: element.tail = " " + element.tail.strip()
Note that "pretty_print=True" reformats the XML by inserting line breaks, specifically what you wanted to avoid here. If you really want a plain-text file in the end, with no markup at all, then use tostring(…, method="text"). https://lxml.de/apidoc/lxml.etree.html#lxml.etree.tostring Stefan
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
Thank you, Jens, Stefan, and Steven, for your prompt and helpful responses. I ended using a variant of Stefan's recommendation: tree= etree.parse(filename) for w in tree.iter(tei + 'w', tei + 'pc'): w.tail = '' if w.tag == tei + 'w': w.text = ' ' + w.text etree.strip_tags(tree, tei + 'w', tei + 'pc') That said, I wonder whether that problem can be solved more flexible using XSLT, a language that makes my brain freeze whenever I see a template. But would XSLT in principle be a more powerful way of formatting the tree? On 9/14/20, 11:26 PM, "lxml on behalf of Stefan Behnel" <lxml-bounces@lxml.de on behalf of stefan_ml@behnel.de> wrote: Hi Martin, Martin Mueller schrieb am 14.09.20 um 22:50: > I have texts that are linguistically and every token is wrapped in a <w> or <pc> element, like so: > > <l xml:id="c0abe-e140"> > <w lemma="and" pos="cc" xml:id="c0abe-001-a-0670">and</w> > <w lemma="therein" pos="av" xml:id="c0abe-001-a-0680">therein</w> > <w lemma="take" pos="vvb" xml:id="c0abe-001-a-0690">take</w> > <w lemma="delight" pos="n1" xml:id="c0abe-001-a-0700">delight</w> > <pc xml:id="c0abe-001-a-0710">:</pc> > </l> > > I'd like to strip the annotation and generate a plain file in which the <w> and <pc> elements are replaced by just their content. Some years ago I used the strip_tags method, and I remember it working. By this I mean that the words in a line of text appeared with a space between words. > But when I use it now, it does strip the tags but the text appears verticalized, with one token on each line. I've tried various methods of getting rid of the line break, but none of them work. > > For instance, I used this bit of code to replace the linebreak with nothing: > > for element in tree.iter('*'): > > plaintext = element.text > if plaintext is not None: > plaintext = plaintext.replace('\n', '') The line breaks in your example above are stored in the "w" elements' ".tail", not in their contained ".text". The following might help you to get the word spacings normalised: for element in tree.iter("w"): if element.tail: element.tail = " " + element.tail.strip() > This does something: if I print out each element, the line breaks go away. But if I then print out the whole file, using > > print(etree.tostring(tree, encoding = 'unicode', pretty_print=True) > > the result is a verticalized text. Which means that the deletion of the line breaks at a later stage, but I don’t know how to do that. Note that "pretty_print=True" reformats the XML by inserting line breaks, specifically what you wanted to avoid here. If you really want a plain-text file in the end, with no markup at all, then use tostring(…, method="text"). https://urldefense.com/v3/__https://lxml.de/apidoc/lxml.etree.html*lxml.etre... Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.com/v3/__http://lxml.de/__;!!Dq0X2DkFhyF93HkjWTBQKhk!CCn6... lxml@lxml.de https://urldefense.com/v3/__https://mailman-mail5.webfaction.com/listinfo/lx...
data:image/s3,"s3://crabby-images/5e181/5e181be423bbdc478f08a6aa349d857671e9854d" alt=""
On Sep 15, 2020, at 4:53 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:
That said, I wonder whether that problem can be solved more flexible using XSLT, a language that makes my brain freeze whenever I see a template. But would XSLT in principle be a more powerful way of formatting the tree?
If XSLT makes your brain freeze, there is also XQuery, which is a functional language that has a more procedural look to it ( not encoded as XML ), but include XPath and XPath functions, so is very good at traversing the tree. ( And current version in Saxon,BaseX,eXist all support Path 3.x, where lxml is XPATH 1.0 (I believe)). However, the XSLT push style of template processing is very handy for documents where the content is very variable and unpredictable, which in a procedural language would usually require a lot of case statements, or else implementing your own template expander manually. — Steve.
data:image/s3,"s3://crabby-images/5e181/5e181be423bbdc478f08a6aa349d857671e9854d" alt=""
You could do something like:
' '.join([ s.strip() for s in etree.parse('/tmp/x.xml').xpath( '//*/text()' ) ] ) ' and therein take delight : ‘
Or you can try using the parser option to remove whitespace text nodes:
etree.parse( '/tmp/x.xml', etree.XMLParser( remove_blank_text = True ) ).xpath( '//*/text()' ) ['and', 'therein', 'take', 'delight', ':’]
— Steve M.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Martin, Martin Mueller schrieb am 14.09.20 um 22:50:
The line breaks in your example above are stored in the "w" elements' ".tail", not in their contained ".text". The following might help you to get the word spacings normalised: for element in tree.iter("w"): if element.tail: element.tail = " " + element.tail.strip()
Note that "pretty_print=True" reformats the XML by inserting line breaks, specifically what you wanted to avoid here. If you really want a plain-text file in the end, with no markup at all, then use tostring(…, method="text"). https://lxml.de/apidoc/lxml.etree.html#lxml.etree.tostring Stefan
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
Thank you, Jens, Stefan, and Steven, for your prompt and helpful responses. I ended using a variant of Stefan's recommendation: tree= etree.parse(filename) for w in tree.iter(tei + 'w', tei + 'pc'): w.tail = '' if w.tag == tei + 'w': w.text = ' ' + w.text etree.strip_tags(tree, tei + 'w', tei + 'pc') That said, I wonder whether that problem can be solved more flexible using XSLT, a language that makes my brain freeze whenever I see a template. But would XSLT in principle be a more powerful way of formatting the tree? On 9/14/20, 11:26 PM, "lxml on behalf of Stefan Behnel" <lxml-bounces@lxml.de on behalf of stefan_ml@behnel.de> wrote: Hi Martin, Martin Mueller schrieb am 14.09.20 um 22:50: > I have texts that are linguistically and every token is wrapped in a <w> or <pc> element, like so: > > <l xml:id="c0abe-e140"> > <w lemma="and" pos="cc" xml:id="c0abe-001-a-0670">and</w> > <w lemma="therein" pos="av" xml:id="c0abe-001-a-0680">therein</w> > <w lemma="take" pos="vvb" xml:id="c0abe-001-a-0690">take</w> > <w lemma="delight" pos="n1" xml:id="c0abe-001-a-0700">delight</w> > <pc xml:id="c0abe-001-a-0710">:</pc> > </l> > > I'd like to strip the annotation and generate a plain file in which the <w> and <pc> elements are replaced by just their content. Some years ago I used the strip_tags method, and I remember it working. By this I mean that the words in a line of text appeared with a space between words. > But when I use it now, it does strip the tags but the text appears verticalized, with one token on each line. I've tried various methods of getting rid of the line break, but none of them work. > > For instance, I used this bit of code to replace the linebreak with nothing: > > for element in tree.iter('*'): > > plaintext = element.text > if plaintext is not None: > plaintext = plaintext.replace('\n', '') The line breaks in your example above are stored in the "w" elements' ".tail", not in their contained ".text". The following might help you to get the word spacings normalised: for element in tree.iter("w"): if element.tail: element.tail = " " + element.tail.strip() > This does something: if I print out each element, the line breaks go away. But if I then print out the whole file, using > > print(etree.tostring(tree, encoding = 'unicode', pretty_print=True) > > the result is a verticalized text. Which means that the deletion of the line breaks at a later stage, but I don’t know how to do that. Note that "pretty_print=True" reformats the XML by inserting line breaks, specifically what you wanted to avoid here. If you really want a plain-text file in the end, with no markup at all, then use tostring(…, method="text"). https://urldefense.com/v3/__https://lxml.de/apidoc/lxml.etree.html*lxml.etre... Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.com/v3/__http://lxml.de/__;!!Dq0X2DkFhyF93HkjWTBQKhk!CCn6... lxml@lxml.de https://urldefense.com/v3/__https://mailman-mail5.webfaction.com/listinfo/lx...
data:image/s3,"s3://crabby-images/5e181/5e181be423bbdc478f08a6aa349d857671e9854d" alt=""
On Sep 15, 2020, at 4:53 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:
That said, I wonder whether that problem can be solved more flexible using XSLT, a language that makes my brain freeze whenever I see a template. But would XSLT in principle be a more powerful way of formatting the tree?
If XSLT makes your brain freeze, there is also XQuery, which is a functional language that has a more procedural look to it ( not encoded as XML ), but include XPath and XPath functions, so is very good at traversing the tree. ( And current version in Saxon,BaseX,eXist all support Path 3.x, where lxml is XPATH 1.0 (I believe)). However, the XSLT push style of template processing is very handy for documents where the content is very variable and unpredictable, which in a procedural language would usually require a lot of case statements, or else implementing your own template expander manually. — Steve.
participants (3)
-
Majewski, Steven Dennis (sdm7g)
-
Martin Mueller
-
Stefan Behnel