aligning a set of word substrings to sentence
Michael Spencer
mahs at telcopartners.com
Thu Dec 1 22:18:06 EST 2005
Steven Bethard wrote:
> I've got a list of word substrings (the "tokens") which I need to align
> to a string of text (the "sentence"). The sentence is basically the
> concatenation of the token list, with spaces sometimes inserted beetween
> tokens. I need to determine the start and end offsets of each token in
> the sentence. For example::
>
> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
> py> text = '''\
> ... She's gonna write
> ... a book?'''
> py> list(offsets(tokens, text))
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>
> Here's my current definition of the offsets function::
>
> py> def offsets(tokens, text):
> ... start = 0
> ... for token in tokens:
> ... while text[start].isspace():
> ... start += 1
> ... text_token = text[start:start+len(token)]
> ... assert text_token == token, (text_token, token)
> ... yield start, start + len(token)
> ... start += len(token)
> ...
>
> I feel like there should be a simpler solution (maybe with the re
> module?) but I can't figure one out. Any suggestions?
>
> STeVe
Hi Steve:
Any reason you can't simply use str.find in your offsets function?
>>> def offsets(tokens, text):
... ptr = 0
... for token in tokens:
... fpos = text.find(token, ptr)
... if fpos != -1:
... end = fpos + len(token)
... yield (fpos, end)
... ptr = end
...
>>> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>
and then, for an entry in the wacky category, a difflib solution:
>>> def offsets(tokens, text):
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...
>>> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>
cheers
Michael
More information about the Python-list
mailing list