Complicated string substitution
Tim Chase
python.list at tim.thechases.com
Wed Feb 13 20:16:31 EST 2008
> I have a file with a lot of the following ocurrences:
>
> denmark.handa.1-10
> denmark.handa.1-12344
> denmark.handa.1-4
> denmark.handa.1-56
Each on its own line? Scattered throughout the text? With other
content that needs to be un-changed? With other stuff on the
same line?
> denmark.handa.1-10_1
> denmark.handa.1-12344_1
> denmark.handa.1-4_1
> denmark.handa.1-56_1
>
> so basically I add "_1" at the end of each ocurrence.
>
> I thought about using sed, but as each "root" is different I have no
> clue how to go through this.
How are the roots different? Do they all begin with
"denmark.handa."? Or can the be found by a pattern of "stuff
period stuff period number dash number"?
A couple sed solutions, since you considered them first:
sed '/denmark\.handa/s/$/_1/'
sed 's/denmark\.handa\.\d+-\d+/&_1/g'
sed 's/[a-z]+\.[a-z]+\.\d+-\d+/&_1/g'
Or are you just looking for "number dash number" and want to
suffix the "_1"?
sed 's/\d+-\d+/&_1/g'
Most of the sed versions translate pretty readily into Python
regexps in the .sub() call.
import re
r = re.compile(r'[a-z]+\.[a-z]+\.\d+-\d+')
out = file('out.txt', 'w')
for line in file('in.txt'):
out.write(r.sub(r'\g<0>_1', line))
out.close()
Tweak the regexps accordingly.
-tkc
More information about the Python-list
mailing list