[Tutor] Iterating over a long list with regular expressions and changing each item?

Dan Liang danliang20 at gmail.com
Mon May 4 03:59:23 CEST 2009


Hi tutors,

I am working on a file and need to replace each occurrence of a certain
label (part of speech tag in this case) by a number of sub-labels. The file
has the following format:

word1  \t    Tag1
word2  \t    Tag2
word3  \t    Tag3

Now the tags are complex and I wanted to split them in a tab-delimited
fashion to have this:

word1   \t   Tag1Part1   \t   Tag2Part2   \t   Tag3Part3

I searched online for some solution and found the code below which uses a
dictionary to store the tags that I want to replace in keys and the sub-tags
as values. The problem with this is that it sometimes replaces tags that are
not surrounded by spaces, which I do not like to happen. Also, I wanted each
new sub-tag to be followed by a tab, so that the new items that I end up
having in my file are tab-delimited. For this, I put tabs between the items
of each key in the dictionary. I started thinking that this will not be the
best solution of the problem and perhaps a script that uses regular
expressions would be better. Since I am new to Python, I thought I should
ask you for your thoughts for a best solution. The items I want to replace
are about 150 and I did not know how to iterate over them with regular
expressions. Below is my previous code:


#!usr/bin/python

import re, sys
f = file(sys.argv[1])
readed= f.read()

def replace_words(text, word_dic):
    for k, v in word_dic.iteritems():
        text = text.replace(k, v)
    return text

# the dictionary has target_word:replacement_word pairs

word_dic = {
'abbrev': 'abbrev    null    null',
'adj': 'adj    null    null',
'adv': 'adv    null    null',
'case_def_acc': 'case_def    acc    null',
'case_def_gen': 'case_def    gen    null',
'case_def_nom': 'case_def    nom    null',
'case_indef_acc': 'case_indef    acc    null',
'verb_part': 'verb_part    null    null'}


# call the function and get the changed text

myString = replace_words(readed, word_dic)


fout = open(sys.argv[2], "w")
fout.write(myString)
fout.close()

--dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090503/bd82a183/attachment.htm>


More information about the Tutor mailing list