[Tutor] about regular expression

Sat Mar 22 10:40:02 2003

--0-1009596418-1048347552=:3501
Content-Type: multipart/alternative; boundary="0-2055281175-1048347552=:3501"

--0-2055281175-1048347552=:3501
Content-Type: text/plain; charset=us-ascii

Hi everyone

Thanks anton for your help. 

I am working on program that incorporates multiple regular expressions: consider that I have tha following : 

exp_token = re.compile(r"""
               ([-a-zA-Z0-9_]+|   # for charcterset
               [\"\'.\(),:!\?]|    # symbol chracters
              <REF SELF='YES'>.*?</REF>)     
               """, re.VERBOSE )

the first two work fine for my program, I am having a problem from the third<REF SELF='YES'>.*?</REF>)     

which I want to parse this pattern <REF SELF='YES'>Dagan et al. 1993</REF>
due to the first two regular expresions  the program is tokenising this way     

<W>REF</W>
<W>SELF</W>
<W>'</W>
<W>YES</W>
<W>'</W>
<W>Dagan</W
<W>et</W>
<W>al</W>
<W>.</W>
<W>1993</W>
<W>REF</W>

instead of this    <W> <REF SELF='YES'>Dagan et al. 1993</REF> </W> incase some one wants to help nad have a look my progaram is attached with this e-mail to see what I am trying to achieve.

thanks in advance.

---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
--0-2055281175-1048347552=:3501
Content-Type: text/html; charset=us-ascii

<P>Hi everyone</P>
<P>Thanks anton for your help. </P>
<P>I am working on program that incorporates multiple regular expressions: consider that I have tha following : </P>
<P>exp_token = re.compile(r"""<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ([-a-zA-Z0-9_]+|&nbsp;&nbsp; # for charcterset<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [\"\'.\(),:!\?]|&nbsp;&nbsp;&nbsp; # symbol chracters<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;REF SELF='YES'&gt;.*?&lt;/REF&gt;)&nbsp;&nbsp;&nbsp;&nbsp; <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; """, re.VERBOSE )</P>
<P>the first two work fine for my program, I am having a problem from the third&lt;REF SELF='YES'&gt;.*?&lt;/REF&gt;)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</P>
<P>which I&nbsp;want to parse this&nbsp;pattern &lt;REF SELF='YES'&gt;Dagan et al. 1993&lt;/REF&gt;<BR>due to the first two&nbsp;regular expresions&nbsp; the program is tokenising this way&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</P>
<P>&lt;W&gt;REF&lt;/W&gt;<BR>&lt;W&gt;SELF&lt;/W&gt;<BR>&lt;W&gt;'&lt;/W&gt;<BR>&lt;W&gt;YES&lt;/W&gt;<BR>&lt;W&gt;'&lt;/W&gt;<BR>&lt;W&gt;Dagan&lt;/W<BR>&lt;W&gt;et&lt;/W&gt;<BR>&lt;W&gt;al&lt;/W&gt;<BR>&lt;W&gt;.&lt;/W&gt;<BR>&lt;W&gt;1993&lt;/W&gt;<BR>&lt;W&gt;REF&lt;/W&gt;</P>
<P>instead of this&nbsp;&nbsp;&nbsp;&nbsp;&lt;W&gt; &lt;REF SELF='YES'&gt;Dagan et al. 1993&lt;/REF&gt; &lt;/W&gt; incase some one wants to help nad have a look my progaram is attached with this e-mail to see what I am trying to achieve.</P>
<P>thanks in advance.<BR></P><p><br><hr size=1>Do you Yahoo!?<br>
<a href="http://rd.yahoo.com/platinum/evt=8162/*http://platinum.yahoo.com/splash.html">Yahoo! Platinum</a> - Watch CBS' NCAA March Madness, <a href="http://rd.yahoo.com/platinum/evt=8162/*http://platinum.yahoo.com/splash.html">live on your desktop</a>!
--0-2055281175-1048347552=:3501--
--0-1009596418-1048347552=:3501
Content-Type: text/plain; name="Ass-01.py"
Content-Description: Ass-01.py
Content-Disposition: inline; filename="Ass-01.py"

import re

def markup(line, tag='W'):
   """ this function tags <w> and </w> to a text """

   exp_token = re.compile(r"""
               ([-a-zA-Z0-9_]+|   # for charcterset
               [\"\'.\(),:!\?]|    # symbol chracters
               (?<=<EQN/>).*?(?=<EQN/>)| #matches <
              <REF SELF='YES'>.*?</REF>)     
               """, re.VERBOSE )

   spacing = " "
   result  = []

   #call all the matching regular expression

   token_list = exp_token.findall(line)
   for token in token_list:
       #for testing purposes
       print '<%s>%s</%s>%s' %(tag,token,tag,spacing)
       result.append('<%s>%s</%s>%s' %(tag,token,tag,spacing) )

   return result

#---------------------------------------------------------------------------
def Process_file(input_file):
   """ this function takes file input and and process by calling
       another function called markup()  """

   token_list = []

   #open the file for processing
   infile = open(input_file)
   line = infile.readline()

   #scan and and call markup function to markup
   while line:
      token_list += markup(line)
      line = infile.readline()
   return token_list
#------------------------------------------------------------------------------
#setup a regular expression for pattern matching
#first_match  = "[a-zA-Z]+"
#exp_one = re.compile(r"[a-zA-Z]+")
# exp_two = re.compile(r"[a-zA-Z]+\-[a-zA-Z]+")

#------------------------------------------------------------------------------

# this function tags every sentence with <S ID= 'S-X> ... </S> after all
#tokens are being wrapped wit <W>...</W>

#def TagSentence(text, sen_tag='S ID = \'S-', theTag= 'S'):
#    """ this function tags all sentences with <SID = 'S-0'> """

#   # get a list of lines removing the leading and trailing lines
#   lines = text.splitlines(1)
#   while lines and lines[-1].isspace():
#       lines.pop()
#   while lines and lines[0].isspace():
#       lines.pop(0)
#       
#   #initialize the line number for display    
#   SenNumb = 0    
#   #iterate through each sentences and tag it with tags
#   for i in range(len(lines)):
#       lines[i] = '<%s%d\'>%s</%s>'%(sen_tag,SenNumb,lines[i],theTag)
#      #increment
#       SenNumb +=1 
#       # join the text   
#       text =''.join(lines)
#
#   return text

if __name__ == '__main__':
    import sys
    for arg in sys.argv[1:]:
        Process_file(arg)

--0-1009596418-1048347552=:3501--