help with search & replace

John Machin sjmachin at lexicon.net
Thu Dec 12 16:16:50 EST 2002


Bill Blue <billblue at cham.com> wrote in message news:<7u6hvu8as4laud5jpp9qapi2kopj1dlsif at 4ax.com>...
> I have a text file that has numbers is many places thru out the file
> ie "20:22:12" .  I need to be able to create a new file and make
> changes only to those numbers above.  I used this to find them
> 
> import re,string,sys

You don't use the string module, so don't import it.
> 
> if = open("infile.txt","r")

Don't use keywords like "if" as variable names.

> of = open("outfile.txt,"w")
> for line in if.xreadlines():
> 	fs = re.compile(r"\d+:\d+:\d+").findall(line)  # there may be
> more than one in a line.
> 	if fs:
> 		for num in fs:
> 			of.write(lne.replace(num,num[0:5]+"00")) # the
> replace text is not always the same
> 	else:
> 		of.write(line)  # for line where no match is found
> 
> The above works if there is only 1 match or number on a line, if there
> are more that one then I get multiple lines in the output file.   in
> most cases there may be 3-6 matches on one line.
> 
> input line is:
> another time  01:12:36 the normal time for these is  01:12:45
> 
> here is the busted output:
> 
> >another time  01:12:00 the normal time for these is  01:12:45  
> >another time  01:12:36  the normal time for these is  01:12:00 
> 
> I would like to be able to replace all occurences in one line with out
> generating multiple output lines.
> 
> 
>  Any help on this wold be appreciated.
> 

Here is one solution, which addresses the immediate programming
problem of writing lines inside the innermost loop:
pat = re.compile(r"\d+:\d+:\d+")
for line in file("infile.txt"):
   for num in pat.findall(line):
      line = line.replace(num,num[0:5]+"00")
   of.write(line)

Warning: if the input contains say "1234:4567:8901" (or "1:2:3") then
your pattern will match it but the replace() will give "1234:00"
(resp. "1:2:300") -- this is probably not what you want. If as appears
you are changing seconds to "00" in hh:mm:ss timestamps, then use \d\d
instead of \d+ for each component.

However findall() followed by multiple calls to replace() is neither
succinct nor robust -- the replace goes searching through the input
string again and in general there is no guarantee that it won't find
(and mangle) a substring that isn't in the list returned by findall()
but has been created by an earlier call to replace().

Read these topics in the re documentation:
(1) re.sub(pattern, replacement, input_string)
and pattern_object.sub(replacement, input_string)
(2) pattern grouping using parentheses
Then you can do something like this:
>>> import re
>>> x = re.compile(r"(\d\d:\d\d:)\d\d")
>>> x.sub(r"\100", "12:34:56 blah blah 98:76:54 foo bar 23:45:6z")
'12:34:00 blah blah 98:76:00 foo bar 23:45:6z'
>>>

Alternatively (for something that simple), use gawk ...

C:\junk>type input.txt
12:34:56 blah blah 98:76:54 foo bar 23:45:6z
12:34:56 blah blah 98:76:54 foo bar 23:45:65
C:\junk>gawk <input.txt "{print
gensub(\"([0-9][0-9]:[0-9][0-9]:)[0-9][0-9]\",\"\\100\",\"g\")}"
12:34:00 blah blah 98:76:00 foo bar 23:45:6z
12:34:00 blah blah 98:76:00 foo bar 23:45:00



More information about the Python-list mailing list