Mailman 3 A strip_tags question - lxml - The Python XML Toolkit

14 Sep 2020

      I have texts that are linguistically and every token is wrapped in a <w> or <pc> element, like so:

<l xml:id="c0abe-e140">
      <w lemma="and" pos="cc" xml:id="c0abe-001-a-0670">and</w>
      <w lemma="therein" pos="av" xml:id="c0abe-001-a-0680">therein</w>
      <w lemma="take" pos="vvb" xml:id="c0abe-001-a-0690">take</w>
      <w lemma="delight" pos="n1" xml:id="c0abe-001-a-0700">delight</w>
      <pc xml:id="c0abe-001-a-0710">:</pc>
     </l> 

I'd like to strip the annotation and generate a plain file in which the <w> and <pc> elements are replaced by just their content. Some years ago I used the strip_tags method, and I remember it working.  By this I mean that the words in a line of text appeared with a space between words. 
But when I use it now, it does strip the tags but the text appears verticalized, with one token on each line.  I've tried various methods of getting rid of the line break, but none of them work.

For instance, I   used this bit of code to replace the linebreak with nothing:

  for element in tree.iter('*'):

        plaintext = element.text
        if plaintext is not None:
            plaintext = plaintext.replace('\n', '')

This does something: if I print out each element, the line breaks go away. But if I then print out the whole file, using 

      print(etree.tostring(tree, encoding =  'unicode', pretty_print=True)

the result is a verticalized text. Which means that the deletion of the line breaks at a later stage, but I don’t know how to do that. 

I'll be grateful for advice

Martin Mueller
Professor English and Classics
Northwestern University

A strip_tags question

Martin Mueller

Majewski, Steven Dennis (sdm7g)

Majewski, Steven Dennis (sdm7g)

Stefan Behnel

Martin Mueller

Majewski, Steven Dennis (sdm7g)

tags

participants (3)