[Tutor] Iterating over a long list with regular expressions and changing each item?
Dan Liang
danliang20 at gmail.com
Mon May 4 03:59:23 CEST 2009
Hi tutors,
I am working on a file and need to replace each occurrence of a certain
label (part of speech tag in this case) by a number of sub-labels. The file
has the following format:
word1 \t Tag1
word2 \t Tag2
word3 \t Tag3
Now the tags are complex and I wanted to split them in a tab-delimited
fashion to have this:
word1 \t Tag1Part1 \t Tag2Part2 \t Tag3Part3
I searched online for some solution and found the code below which uses a
dictionary to store the tags that I want to replace in keys and the sub-tags
as values. The problem with this is that it sometimes replaces tags that are
not surrounded by spaces, which I do not like to happen. Also, I wanted each
new sub-tag to be followed by a tab, so that the new items that I end up
having in my file are tab-delimited. For this, I put tabs between the items
of each key in the dictionary. I started thinking that this will not be the
best solution of the problem and perhaps a script that uses regular
expressions would be better. Since I am new to Python, I thought I should
ask you for your thoughts for a best solution. The items I want to replace
are about 150 and I did not know how to iterate over them with regular
expressions. Below is my previous code:
#!usr/bin/python
import re, sys
f = file(sys.argv[1])
readed= f.read()
def replace_words(text, word_dic):
for k, v in word_dic.iteritems():
text = text.replace(k, v)
return text
# the dictionary has target_word:replacement_word pairs
word_dic = {
'abbrev': 'abbrev null null',
'adj': 'adj null null',
'adv': 'adv null null',
'case_def_acc': 'case_def acc null',
'case_def_gen': 'case_def gen null',
'case_def_nom': 'case_def nom null',
'case_indef_acc': 'case_indef acc null',
'verb_part': 'verb_part null null'}
# call the function and get the changed text
myString = replace_words(readed, word_dic)
fout = open(sys.argv[2], "w")
fout.write(myString)
fout.close()
--dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090503/bd82a183/attachment.htm>
More information about the Tutor
mailing list