[Tutor] Removing Content from Lines....

Fri Mar 25 04:30:51 EDT 2016

Martin A. Brown wrote:

> 
> Greetings Sam,
> 
>>Hello,I am hoping you experts can help out a newbie.I have written
>>some python code to gather a bunch of sentences from many files.
>>These sentences contain the content:
>>
>>blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah
>>blah blah blah blah blah blah blah <uicontrol>Preset</uicontrol>blah blah
>>blah blah blah blah blah <uicontrol>Preset</uicontrol> blah blah blah
>>
>>What I want to do is remove the words before the <uicontrol> and
>>after the </uicontrol>. How do I do that in Python?  Thanks for any
>>and all help.
> 
> That looks like DocBook markup to me.
> 
> If you actually wish only to mangle the strings, you can do the
> following:
> 
>   def show_just_uicontrol_elements(fin):
>       for line in fin:
>           s = line.index('<uicontrol>')
>           e = line.index('</uicontrol>') + len('</uicontrol>')
>           print(line[s:e])
> 
> (Given your question, I'm assuming you can figure out how to open a
> file and read line by line in Python.)
> 
> If it's XML, though, you might consider more carefully Alan's
> admonitions and his suggestion of lxml, as a bit of a safer choice
> for handling XML data.
> 
>   def listtags(filename, tag=None):
>       if tag is None:
>           sys.exit("Provide a tag name to this function, e.g. uicontrol")
>       doc = lxml.etree.parse(filename)
>       sought = list(doc.getroot().iter(tag))
>       for element in sought:
>           print(element.tag, element.text)
> 
> If you were to call the above with:
> 
>   listtags('/path/to/the/data/file.xml', 'uicontrol')
> 
> You should see the name 'uicontrol' and the contents of each tag,
> stripped of all surrounding context.
> 
> The above snippet is really just an example to show you how easy it
> is (from a coding perspective) to use lxml.  You still have to make
> the investment to understand how lxml processes XML and what your
> data processing needs are.
> 
> Good luck in your explorations,

Another approach that uses XPath, also provided by lxml:

$ cat tmp.xml
<foo>
blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah blah 
blah blah blah blah
blah blah <uicontrol>Preset</uicontrol>blah blah blah blah
blah blah blah <uicontrol>Preset</uicontrol> blah blah blah
</foo>
$ python3
[...]
>>> from lxml import etree
>>> root = etree.parse("tmp.xml")
>>> root.xpath("//uicontrol/text()")
['1-up printing', 'Preset', 'Preset']

https://en.wikipedia.org/wiki/XPath