[Tutor] about Regular Expression in Class function

Abdirizak abdi a_abdi406@yahoo.com
Wed Apr 30 17:09:51 2003


--0-379051222-1051713514=:64128
Content-Type: text/plain; charset=us-ascii

hi, I have a class that indexes some text,this class has a function calledreadline() it reads a line from a file, what I need to do is to pass RE(regular _expression) so that only normal words are extracted because the input file(i.e file to be read ) has some XML tags so for e.g. the file has this<S ID='S-0'> <W>Similarity-Based</W> <W>Estimation</W> <W>of</W> <W>Word</W> <W>Cooccurrence</W> <W>Probabilities</W> </S> and want to extract ['Similarity based', 'Estimation'.....]    I think the problem is it doesn't accept the class instances such as : aword =re.compile (r'<W>([^<]+)</W>')self.line = aword.findall(self.line)  Can anyone help me to suggest how I can incorporate this RE.Here is the function:Class text_word_iterator......def readline(self):
  #aword =re.compile (r'<[^<>]*>|\b[\w-]+\b')#|<W>([^<]+)</W>') #added now
        aword =re.compile (r'<W>([^<]+)</W>') #added now
        self.line = self.file.readline() #already there
        self.line = aword.findall(self.line) #added now for xml--should work
        self.line = ' '.join(self.line) #this is a extra lin added -- works
        print self.line
        
        if self.line == '': self.words = None
        else: self.words = filter(None, re.split(r'\W+', self.line))#mod form \W ->\s+
         thanks in advance

---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
--0-379051222-1051713514=:64128
Content-Type: text/html; charset=us-ascii

<DIV>
<DIV>hi,</DIV>
<DIV>&nbsp;</DIV>
<DIV>I have a class that indexes some text,this class has a function called</DIV>
<DIV>readline() it reads a line from a file, what I need to do is to pass </DIV>
<DIV>RE(regular _expression) so that only normal words are extracted because the input file(i.e file to be read ) has some XML tags so for e.g. the file has this</DIV>
<DIV><FONT size=2>&lt;S ID='S-0'&gt; &lt;W&gt;Similarity-Based&lt;/W&gt; &lt;W&gt;Estimation&lt;/W&gt; &lt;W&gt;of&lt;/W&gt; &lt;W&gt;Word&lt;/W&gt; &lt;W&gt;Cooccurrence&lt;/W&gt; &lt;W&gt;Probabilities&lt;/W&gt; &lt;/S&gt;</FONT></DIV>
<DIV><FONT size=2></FONT>&nbsp;</DIV>
<DIV><FONT size=2>and want to extract ['Similarity based', 'Estimation'.....]&nbsp;&nbsp;&nbsp; I think the problem is it doesn't accept the class instances such as :</FONT></DIV>
<DIV><FONT size=2></FONT>&nbsp;</DIV>
<DIV>aword =re.compile (r'&lt;W&gt;([^&lt;]+)&lt;/W&gt;')</DIV>
<DIV><FONT size=2>self.line = aword.findall(self.line) </FONT></DIV>
<DIV><FONT size=2></FONT>&nbsp;</DIV>
<DIV><FONT size=2>Can anyone help me to suggest how I can incorporate this RE.</FONT></DIV>
<DIV><FONT size=2>Here is the function:</FONT></DIV>
<DIV><FONT size=2>Class text_word_iterator</FONT></DIV>
<DIV>......</DIV>
<DIV><FONT size=2>def readline(self):<BR>&nbsp;&nbsp;#aword =re.compile (r'&lt;[^&lt;&gt;]*&gt;|\b[\w-]+\b')#|&lt;W&gt;([^&lt;]+)&lt;/W&gt;') #added now<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aword =re.compile (r'&lt;W&gt;([^&lt;]+)&lt;/W&gt;') #added now<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.line = self.file.readline() #already there<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.line = aword.findall(self.line) #added now for xml--should work<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.line = ' '.join(self.line) #this is a extra lin added -- works<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print self.line<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if self.line == '': self.words = None<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else: self.words = filter(None, re.split(r'\W+', self.line))#mod form \W -&gt;\s+<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </FONT></DIV>
<DIV><FONT size=2></FONT>&nbsp;</DIV>
<DIV><FONT size=2>thanks in advance</DIV></FONT></DIV><p><hr SIZE=1>
Do you Yahoo!?<br>
<a href="http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com">The New Yahoo! Search</a> - Faster. Easier. Bingo.
--0-379051222-1051713514=:64128--