[Tutor] program that processes tokenized words in xml

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue May 6 15:49:01 2003


On Tue, 6 May 2003, Abdirizak abdi wrote:

> Hi everyone, I was working on a program that indexes a file that has a
> tokenizedwords such as the following: <S ID='S-0'>
> <W>Similarity-Based</W> <W>Estimation</W> <W>of</W> <W>Word</W> what my
> program needs to do is to index the words in between <W>...</W> I
> already set up class that reads the file line by line,Can anyone suggest
> how I can incorporate a regular expression for eliminating these tags?
> I have attached the program with this e-mail ....Please help and have a
> look thanks in advance


Hi Abdirizak,


Wow, you're doing a lot of language stuff stuff nowdays!  Very cool.  Out
of curiosity: do you know of a good sparse matrix multiplication module?
Cameron's question on search engines a few days ago got me interested in
doing vector-based search engines.  At the moment, I'm using pysparse:

    http://www.inf.ethz.ch/personal/geus/pyfemax/pysparse.html

and I'm getting very awesome results... Yikes, I'm getting too excited
about this stuff.  Sorry for going off topic!  If anyone's interested, I
can do a small tutorial on a vector-based search engine in Python now, I
think.  *grin*



Anyway, you can probably use the function re.findall() to grab all words
between '<W>' tags.  Here's one way to do it:

###
>>> import re
>>> w_regex = re.compile(r'<W>(.+?)</W>')
>>> w_regex.findall('<W>similarity-based</W> <W>Estimation</W>')
['similarity-based', 'Estimation']
###


The trick here is to make sure the regular expression knows that it needs
to be "nongreedy".  That is, if we give it something like:

    <W>similarity-based</W> <W>Estimation</W>



we want to make sure that it does:

    <W>similarity-based</W> <W>Estimation</W>
    |---------------------| |---------------|
          (match 1)              (match 2)


and not,

    <W>similarity-based</W> <W>Estimation</W>
    |---------------------------------------|
                    (match 1)


Compare the results above to the regular expression:

    w_regex_broken = re.compile(r'<W>(.+)</W>')

and the idea of greedy-vs-nongreedy matching should make sense.




However, if we're guaranteed that our input is XML, I'd heavily recommend
looking into using an XML parser instead:

###
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString('''
... <S ID='S-0'>
... <W>similarity-based</W> <W>Estimation</W>
... </S>''')
>>>
>>> dom
<xml.dom.minidom.Document instance at 0x8205834>
>>>
>>> all_word_nodes = dom.getElementsByTagName('W')
>>> all_word_nodes
[<DOM Element: W at 136824556>, <DOM Element: W at 136841844>]
>>>
>>> def getText(node):
...     text_nodes = [n for n in node.childNodes
...                   if n.nodeType == n.TEXT_NODE]
...     texts = [n.data for n in text_nodes]
...     return ''.join(texts)
...
>>> map(getText, all_word_nodes)
[u'similarity-based', u'Estimation']
###


... ok, part of this does look more complicated than the regular
expression stuff.  *grin* But it might be worth learning how to use an XML
parser if you're planning do any deep diving into the structure of your
documents.


Good luck to you!