aligning a set of word substrings to sentence
steven.bethard at gmail.com
Fri Dec 2 16:34:18 CET 2005
Fredrik Lundh wrote:
> Steven Bethard wrote:
>>>>I feel like there should be a simpler solution (maybe with the re
>>>>module?) but I can't figure one out. Any suggestions?
>>>using the finditer pattern I just posted in another thread:
>>>tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
>>>text = '''\
>>>She's gonna write
>>>tokens.sort() # lexical order
>>>tokens.reverse() # look for longest match first
>>>pattern = "|".join(map(re.escape, tokens))
>>>pattern = re.compile(pattern)
>>>print [m.span() for m in pattern.finditer(text)]
>>>[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>which seems to match your version pretty well.
>>That's what I was looking for. Thanks!
> except that I misread your problem statement; the RE solution above allows the
> tokens to be specified in arbitrary order. if they've always ordered, you can re-
> place the code with something like:
> # match tokens plus optional whitespace between each token
> pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
> m = re.match(pattern, text)
> result = (m.span(i+1) for i in range(len(tokens)))
> which is 6-7 times faster than the previous solution, on my machine.
Ahh yes, that's faster for me too. Thanks again!
More information about the Python-list