doing hundreds of re.subs efficiently on large strings
Anders J. Munch
andersjm at dancontrol.dk
Wed Mar 26 06:13:52 EST 2003
"nihilo" <exnihilo at NOmyrealCAPSbox.com> wrote:
> I have a simple application that converts a file in one format to a file
> in another format using string.replace and re.sub. There are about 150
> replacements occuring on files that are on the order of a few KB. Right
> now, I represent the file contents as a string, and naively use
> string.replace and re.sub to replace all the substrings that I need to
> replace. I knew that I would have to come up with a faster, more memory
> friendly solution eventually, but I just wanted to implement the basic
> idea as simply as possible at first.
Hmm, 150 replacements on a few-KB text, that shouldn't take long. Why
do you need it to be faster and more memory-friendly? Are there
realtime constraints? Do you have a lot of these files that you need
to process in batch?
<plug>
If so, it might be worth taking a look at my Msub
http://home.inbound.dk/~andersjm/provide/msub/
in which all 150 replacements can be done efficiently in parallel.
</plug>
Staying in Python, and assuming that there are no overlaps between
matches, here's a way to do single-pass replace of all your
regexps.
Combine your 150 regexps into one big regexp. (Say regexps A,B and C
combine into (A)|(B)|(C), you get the picture.) Then use re.finditer
(assuming Python version >=2.2) to locate the matching substrings one
by one, as well as the intermediate unmatched parts. Which component
was matched can be identified by looking at the position of the first
non-None entry in groups(). Collect the output (unmatched
parts+replacement strings) in a list of string fragments which you
''.join into one big string at the end.
- Anders
More information about the Python-list
mailing list