passing multiple strings to string.find()
Francois Pinard
pinard at iro.umontreal.ca
Fri Aug 8 08:41:15 EDT 2003
[Bengt Richter]
> If some search strings have a common prefix, you'll have to put the
> longest first in the regex, since re grabs the first match it sees.
Hi, gang.
I recently wanted to match among a list of keywords, repeatedly, and wanted
to help the Python `re' module a bit, speed-wise. I wrote the following
helper function (I stole the idea of this from Markku Rossi's `enscript'):
def build_regexp(tokens):
# Build an optimised regular expression able to recognise all TOKENS.
tokens_by_first = {}
empty = False
for token in tokens:
if token == '':
empty = True
else:
sequence = tokens_by_first.get(token[0])
if sequence is None:
sequence = tokens_by_first[token[0]] = []
sequence.append(token[1:])
if not tokens_by_first:
return ''
fragments = [re.escape(letter) + build_regexp(tokens)
for letter, tokens in tokens_by_first.iteritems()]
if empty:
fragments.append('')
if len(fragments) == 1:
return fragments[0]
return '(?:%s)' % '|'.join(fragments)
Given the above,
build_regexp(['this', 'that', 'the-other'])
yields the string 'th(?:is|at|e\\-other)', which one may choose to
`re.compile' before use. Here is an real pattern produced using the
above device, meant for a LilyPond note in English notation:
"((?:a(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|c(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|b(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|e(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|d(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|g(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)|f(?:s(?:harp(?:sharp|)|s|)|f(?:lat(?:flat|)|f|)|)))(,+|'*)([!?]?)((?:64|32|16|8|4|2|1|)\\.*)"
You agree that this would be fairly tedious to type in with an editor! :-)
--
François Pinard http://www.iro.umontreal.ca/~pinard
More information about the Python-list
mailing list