passing multiple strings to string.find()

Francois Pinard pinard at iro.umontreal.ca
Fri Aug 8 08:41:15 EDT 2003


[Bengt Richter]

> If some search strings have a common prefix, you'll have to put the
> longest first in the regex, since re grabs the first match it sees.

Hi, gang.

I recently wanted to match among a list of keywords, repeatedly, and wanted
to help the Python `re' module a bit, speed-wise.  I wrote the following
helper function (I stole the idea of this from Markku Rossi's `enscript'):


def build_regexp(tokens):
    # Build an optimised regular expression able to recognise all TOKENS.
    tokens_by_first = {}
    empty = False
    for token in tokens:
        if token == '':
            empty = True
        else:
            sequence = tokens_by_first.get(token[0])
            if sequence is None:
                sequence = tokens_by_first[token[0]] = []
            sequence.append(token[1:])
    if not tokens_by_first:
        return ''
    fragments = [re.escape(letter) + build_regexp(tokens)
                 for letter, tokens in tokens_by_first.iteritems()]
    if empty:
        fragments.append('')
    if len(fragments) == 1:
        return fragments[0]
    return '(?:%s)' % '|'.join(fragments)


Given the above,


    build_regexp(['this', 'that', 'the-other'])


yields the string 'th(?:is|at|e\\-other)', which one may choose to
`re.compile' before use.  Here is an real pattern produced using the
above device, meant for a LilyPond note in English notation:


"((?:a(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|c(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|b(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|e(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|d(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|g(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|f(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)))(,+|'*)([!?]?)((?:64|32|16|8|4|2|1|)\\.*)"


You agree that this would be fairly tedious to type in with an editor! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard





More information about the Python-list mailing list