Python replace multiple strings (m*n) combination

MRAB python at mrabarnett.plus.com
Fri Feb 24 17:36:18 EST 2017


On 2017-02-24 22:18, kar wrote:
> On Friday, February 24, 2017 at 11:48:22 AM UTC-8, MRAB wrote:
>> On 2017-02-24 18:54, kar6308 at gmail.com wrote:
>> > I have a task to search for multiple patterns in incoming string and replace with matched patterns, I'm storing all pattern as keys in dict and replacements as values, I'm using regex for compiling all the pattern and using the sub method on pattern object for replacement. But the problem I have a tens of millions of rows, that I need to check for pattern which is about 1000 and this is turns out to be a very expensive operation.
>> >
>> > What can be done to optimize it. Also I have special characters for matching, where can I specify raw string combinations.
>> >
>> > for example is the search string is not a variable we can say
>> >
>> > re.search(r"\$%^search_text", "replace_text", "some_text") but when I read from the dict where shd I place the "r" keyword, unfortunately putting inside key doesnt work "r key" like this....
>> >
>> > Pseudo code
>> >
>> > for string in genobj_of_million_strings:
>> >    pattern = re.compile('|'.join(regex_map.keys()))
>> >    return pattern.sub(lambda x: regex_map[x], string)
>> >
>> Here's an example:
>>
>> import re
>>
>> # A dict of the replacements.
>> mapping = {'one': 'unu', 'two': 'du', 'three': 'tri', 'four': 'kvar',
>> 'five': 'kvin'}
>>
>> # The text that we're searching.
>> text = 'one two three four five six seven eight nine ten'
>>
>> # It's best to put the strings we're looking for into reverse order in
>> # case one of the keys is the prefix of another.
>> ordered_keys = sorted(mapping.keys(), reverse=True)
>> ordered_values = [mapping[key] for key in ordered_keys]
>>
>> # Build the pattern, putting each key in its own group.
>> # I'm assuming that the keys are all pure literals, that they don't
>> # contain anything that's treated specially by regex. You could escape
>> # the key (using re.escape(...)) if that's not the case.
>> pattern = re.compile('|'.join('(%s)' % key for key in ordered_keys))
>>
>> # When we find a match, the match object's .lastindex attribute will
>> # tell us which group (i.e. key) matched. We can then look up the
>> # replacement. We also need to take into account that the groups are
>> # numbered from 1, whereas list items are numbered from 0.
>> new_text = pattern.sub(lambda m: ordered_values[m.lastindex - 1], text)
>>
>>
>> It might be faster (timing it would be a good idea) if you could put all
>> of the rows into a single string (or a number of rows into a single
>> srting), process that string, and then split up the result. If none of
>> the rows contain '\n', then you could join them together with that,
>> otherwise just pick some other character.
>
> Thanks, what is the idea behind storing the keys and values in a list, I assume looking up for a value in a map is faster getting the value from the list.
>
If you're only looking for _exact_ literal matches, then you can use a 
dict because you know exactly what it'll find, but if you're ignoring 
the case, or are going to, say, look for any digits using \d, and, 
therefore, don't know exactly what it'll find (might be uppercase, might 
be lowercase, might be any digit), then this technique is better. I'm 
just hedging my bets!

> Also I like the idea of combining multiple rows into one string and passing it. I would batch up the million rows in to strings and give it a shot.
>
If your computer has enough memory for it, then yes.




More information about the Python-list mailing list