[Tutor] Removing Content from Lines....
Peter Otten
__peter__ at web.de
Fri Mar 25 04:30:51 EDT 2016
Martin A. Brown wrote:
>
> Greetings Sam,
>
>>Hello,I am hoping you experts can help out a newbie.I have written
>>some python code to gather a bunch of sentences from many files.
>>These sentences contain the content:
>>
>>blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah
>>blah blah blah blah blah blah blah <uicontrol>Preset</uicontrol>blah blah
>>blah blah blah blah blah <uicontrol>Preset</uicontrol> blah blah blah
>>
>>What I want to do is remove the words before the <uicontrol> and
>>after the </uicontrol>. How do I do that in Python? Thanks for any
>>and all help.
>
> That looks like DocBook markup to me.
>
> If you actually wish only to mangle the strings, you can do the
> following:
>
> def show_just_uicontrol_elements(fin):
> for line in fin:
> s = line.index('<uicontrol>')
> e = line.index('</uicontrol>') + len('</uicontrol>')
> print(line[s:e])
>
> (Given your question, I'm assuming you can figure out how to open a
> file and read line by line in Python.)
>
> If it's XML, though, you might consider more carefully Alan's
> admonitions and his suggestion of lxml, as a bit of a safer choice
> for handling XML data.
>
> def listtags(filename, tag=None):
> if tag is None:
> sys.exit("Provide a tag name to this function, e.g. uicontrol")
> doc = lxml.etree.parse(filename)
> sought = list(doc.getroot().iter(tag))
> for element in sought:
> print(element.tag, element.text)
>
> If you were to call the above with:
>
> listtags('/path/to/the/data/file.xml', 'uicontrol')
>
> You should see the name 'uicontrol' and the contents of each tag,
> stripped of all surrounding context.
>
> The above snippet is really just an example to show you how easy it
> is (from a coding perspective) to use lxml. You still have to make
> the investment to understand how lxml processes XML and what your
> data processing needs are.
>
> Good luck in your explorations,
Another approach that uses XPath, also provided by lxml:
$ cat tmp.xml
<foo>
blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah blah
blah blah blah blah
blah blah <uicontrol>Preset</uicontrol>blah blah blah blah
blah blah blah <uicontrol>Preset</uicontrol> blah blah blah
</foo>
$ python3
[...]
>>> from lxml import etree
>>> root = etree.parse("tmp.xml")
>>> root.xpath("//uicontrol/text()")
['1-up printing', 'Preset', 'Preset']
https://en.wikipedia.org/wiki/XPath
More information about the Tutor
mailing list