Count and replacing strings a texfile

Greg Jorgensen gregj at pobox.com
Wed Jan 24 14:58:02 EST 2001


In article <94mans01r5p at news2.newsguy.com>,
  "Alex Martelli" <aleaxit at yahoo.com> wrote:
> ...
> I like this general approach better than the RE-based
> ones; however, building up the 'result' string by
> successive concatenations is apt to be pretty slow --
> remember the typical start string was said to be over
> 120k and to contain 'several' occurrences of '%id%'.
>
> In general, building a big string by successive + or
> += of many small pieces is O(N squared).

I follow the K&R maxim to keep it simple, make it work first, and if
it's too slow, speed up the slow parts. If this is a one-off
performance is probably of little concern. In my experience with Python
doing text processing on lots of little files and a few big ones, I
haven't seen any performance problems worth worrying about. But your
analysis is correct: successive concatentation of chunks of text is not
the most efficient technique, and there is lots of room for speeding up
that bit if necessary. Pre-allocating the result string is probably the
best and easiest optimization. In this case the maximum size of the
result string can be calculated in advance:

  n = len(original_string)
  x = number of occurences of %id%
  resultlen = n - (x * len("%id%")) + (x * len(str(x)))

Then just use rstrip() on the result to remove the trailing spaces. Or
get the exact length:

  m = 0
  for i in range(1, x+1):
      m += len(str(i))

  resultlen = n - (x * len("%id%")) + m

But I'm assuming that allocating the maximum length, then doing one
rstrip() on the result string is faster than iterating over the list
just to get the exact length.

The mathematically inclined can calculate the exact number of
characters in the sequence 1..x without iterating over the list.

--
Greg Jorgensen
Portland, Oregon, USA
gregj at pobox.com


Sent via Deja.com
http://www.deja.com/



More information about the Python-list mailing list