aligning a set of word substrings to sentence

Thu Dec 1 22:18:06 EST 2005

Steven Bethard wrote:
> I've got a list of word substrings (the "tokens") which I need to align 
> to a string of text (the "sentence").  The sentence is basically the 
> concatenation of the token list, with spaces sometimes inserted beetween 
> tokens.  I need to determine the start and end offsets of each token in 
> the sentence.  For example::
> 
> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
> py> text = '''\
> ... She's gonna write
> ... a book?'''
> py> list(offsets(tokens, text))
> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
> 
> Here's my current definition of the offsets function::
> 
> py> def offsets(tokens, text):
> ...     start = 0
> ...     for token in tokens:
> ...         while text[start].isspace():
> ...             start += 1
> ...         text_token = text[start:start+len(token)]
> ...         assert text_token == token, (text_token, token)
> ...         yield start, start + len(token)
> ...         start += len(token)
> ...
> 
> I feel like there should be a simpler solution (maybe with the re 
> module?) but I can't figure one out.  Any suggestions?
> 
> STeVe

Hi Steve:

Any reason you can't simply use str.find in your offsets function?

  >>> def offsets(tokens, text):
  ...     ptr = 0
  ...     for token in tokens:
  ...         fpos = text.find(token, ptr)
  ...         if fpos != -1:
  ...             end = fpos + len(token)
  ...             yield (fpos, end)
  ...             ptr = end
  ...
  >>> list(offsets(tokens, text))
  [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
  >>>

and then, for an entry in the wacky category, a difflib solution:

  >>> def offsets(tokens, text):
  ...     from difflib import SequenceMatcher
  ...     s = SequenceMatcher(None, text, "\t".join(tokens))
  ...     for start, _, length in s.get_matching_blocks():
  ...         if length:
  ...             yield start, start + length
  ...
  >>> list(offsets(tokens, text))
  [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
  >>>

cheers
Michael