[Tutor] about regular expression
Abdirizak abdi
a_abdi406@yahoo.com
Sat Mar 22 10:40:02 2003
--0-1009596418-1048347552=:3501
Content-Type: multipart/alternative; boundary="0-2055281175-1048347552=:3501"
--0-2055281175-1048347552=:3501
Content-Type: text/plain; charset=us-ascii
Hi everyone
Thanks anton for your help.
I am working on program that incorporates multiple regular expressions: consider that I have tha following :
exp_token = re.compile(r"""
([-a-zA-Z0-9_]+| # for charcterset
[\"\'.\(),:!\?]| # symbol chracters
<REF SELF='YES'>.*?</REF>)
""", re.VERBOSE )
the first two work fine for my program, I am having a problem from the third<REF SELF='YES'>.*?</REF>)
which I want to parse this pattern <REF SELF='YES'>Dagan et al. 1993</REF>
due to the first two regular expresions the program is tokenising this way
<W>REF</W>
<W>SELF</W>
<W>'</W>
<W>YES</W>
<W>'</W>
<W>Dagan</W
<W>et</W>
<W>al</W>
<W>.</W>
<W>1993</W>
<W>REF</W>
instead of this <W> <REF SELF='YES'>Dagan et al. 1993</REF> </W> incase some one wants to help nad have a look my progaram is attached with this e-mail to see what I am trying to achieve.
thanks in advance.
---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
--0-2055281175-1048347552=:3501
Content-Type: text/html; charset=us-ascii
<P>Hi everyone</P>
<P>Thanks anton for your help. </P>
<P>I am working on program that incorporates multiple regular expressions: consider that I have tha following : </P>
<P>exp_token = re.compile(r"""<BR> ([-a-zA-Z0-9_]+| # for charcterset<BR> [\"\'.\(),:!\?]| # symbol chracters<BR> <REF SELF='YES'>.*?</REF>) <BR> """, re.VERBOSE )</P>
<P>the first two work fine for my program, I am having a problem from the third<REF SELF='YES'>.*?</REF>) </P>
<P>which I want to parse this pattern <REF SELF='YES'>Dagan et al. 1993</REF><BR>due to the first two regular expresions the program is tokenising this way </P>
<P><W>REF</W><BR><W>SELF</W><BR><W>'</W><BR><W>YES</W><BR><W>'</W><BR><W>Dagan</W<BR><W>et</W><BR><W>al</W><BR><W>.</W><BR><W>1993</W><BR><W>REF</W></P>
<P>instead of this <W> <REF SELF='YES'>Dagan et al. 1993</REF> </W> incase some one wants to help nad have a look my progaram is attached with this e-mail to see what I am trying to achieve.</P>
<P>thanks in advance.<BR></P><p><br><hr size=1>Do you Yahoo!?<br>
<a href="http://rd.yahoo.com/platinum/evt=8162/*http://platinum.yahoo.com/splash.html">Yahoo! Platinum</a> - Watch CBS' NCAA March Madness, <a href="http://rd.yahoo.com/platinum/evt=8162/*http://platinum.yahoo.com/splash.html">live on your desktop</a>!
--0-2055281175-1048347552=:3501--
--0-1009596418-1048347552=:3501
Content-Type: text/plain; name="Ass-01.py"
Content-Description: Ass-01.py
Content-Disposition: inline; filename="Ass-01.py"
import re
def markup(line, tag='W'):
""" this function tags <w> and </w> to a text """
exp_token = re.compile(r"""
([-a-zA-Z0-9_]+| # for charcterset
[\"\'.\(),:!\?]| # symbol chracters
(?<=<EQN/>).*?(?=<EQN/>)| #matches <
<REF SELF='YES'>.*?</REF>)
""", re.VERBOSE )
spacing = " "
result = []
#call all the matching regular expression
token_list = exp_token.findall(line)
for token in token_list:
#for testing purposes
print '<%s>%s</%s>%s' %(tag,token,tag,spacing)
result.append('<%s>%s</%s>%s' %(tag,token,tag,spacing) )
return result
#---------------------------------------------------------------------------
def Process_file(input_file):
""" this function takes file input and and process by calling
another function called markup() """
token_list = []
#open the file for processing
infile = open(input_file)
line = infile.readline()
#scan and and call markup function to markup
while line:
token_list += markup(line)
line = infile.readline()
return token_list
#------------------------------------------------------------------------------
#setup a regular expression for pattern matching
#first_match = "[a-zA-Z]+"
#exp_one = re.compile(r"[a-zA-Z]+")
# exp_two = re.compile(r"[a-zA-Z]+\-[a-zA-Z]+")
#------------------------------------------------------------------------------
# this function tags every sentence with <S ID= 'S-X> ... </S> after all
#tokens are being wrapped wit <W>...</W>
#def TagSentence(text, sen_tag='S ID = \'S-', theTag= 'S'):
# """ this function tags all sentences with <SID = 'S-0'> """
# # get a list of lines removing the leading and trailing lines
# lines = text.splitlines(1)
# while lines and lines[-1].isspace():
# lines.pop()
# while lines and lines[0].isspace():
# lines.pop(0)
#
# #initialize the line number for display
# SenNumb = 0
# #iterate through each sentences and tag it with tags
# for i in range(len(lines)):
# lines[i] = '<%s%d\'>%s</%s>'%(sen_tag,SenNumb,lines[i],theTag)
# #increment
# SenNumb +=1
# # join the text
# text =''.join(lines)
#
# return text
if __name__ == '__main__':
import sys
for arg in sys.argv[1:]:
Process_file(arg)
--0-1009596418-1048347552=:3501--