[Tutor] program that processes tokenized words in xml
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Tue May 6 15:49:01 2003
On Tue, 6 May 2003, Abdirizak abdi wrote:
> Hi everyone, I was working on a program that indexes a file that has a
> tokenizedwords such as the following: <S ID='S-0'>
> <W>Similarity-Based</W> <W>Estimation</W> <W>of</W> <W>Word</W> what my
> program needs to do is to index the words in between <W>...</W> I
> already set up class that reads the file line by line,Can anyone suggest
> how I can incorporate a regular expression for eliminating these tags?
> I have attached the program with this e-mail ....Please help and have a
> look thanks in advance
Hi Abdirizak,
Wow, you're doing a lot of language stuff stuff nowdays! Very cool. Out
of curiosity: do you know of a good sparse matrix multiplication module?
Cameron's question on search engines a few days ago got me interested in
doing vector-based search engines. At the moment, I'm using pysparse:
http://www.inf.ethz.ch/personal/geus/pyfemax/pysparse.html
and I'm getting very awesome results... Yikes, I'm getting too excited
about this stuff. Sorry for going off topic! If anyone's interested, I
can do a small tutorial on a vector-based search engine in Python now, I
think. *grin*
Anyway, you can probably use the function re.findall() to grab all words
between '<W>' tags. Here's one way to do it:
###
>>> import re
>>> w_regex = re.compile(r'<W>(.+?)</W>')
>>> w_regex.findall('<W>similarity-based</W> <W>Estimation</W>')
['similarity-based', 'Estimation']
###
The trick here is to make sure the regular expression knows that it needs
to be "nongreedy". That is, if we give it something like:
<W>similarity-based</W> <W>Estimation</W>
we want to make sure that it does:
<W>similarity-based</W> <W>Estimation</W>
|---------------------| |---------------|
(match 1) (match 2)
and not,
<W>similarity-based</W> <W>Estimation</W>
|---------------------------------------|
(match 1)
Compare the results above to the regular expression:
w_regex_broken = re.compile(r'<W>(.+)</W>')
and the idea of greedy-vs-nongreedy matching should make sense.
However, if we're guaranteed that our input is XML, I'd heavily recommend
looking into using an XML parser instead:
###
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString('''
... <S ID='S-0'>
... <W>similarity-based</W> <W>Estimation</W>
... </S>''')
>>>
>>> dom
<xml.dom.minidom.Document instance at 0x8205834>
>>>
>>> all_word_nodes = dom.getElementsByTagName('W')
>>> all_word_nodes
[<DOM Element: W at 136824556>, <DOM Element: W at 136841844>]
>>>
>>> def getText(node):
... text_nodes = [n for n in node.childNodes
... if n.nodeType == n.TEXT_NODE]
... texts = [n.data for n in text_nodes]
... return ''.join(texts)
...
>>> map(getText, all_word_nodes)
[u'similarity-based', u'Estimation']
###
... ok, part of this does look more complicated than the regular
expression stuff. *grin* But it might be worth learning how to use an XML
parser if you're planning do any deep diving into the structure of your
documents.
Good luck to you!