Best practice for operations on streams of text

Thu May 7 17:07:42 EDT 2009

James wrote:
> Hello all,
> I'm working on some NLP code - what I'm doing is passing a large
> number of tokens through a number of filtering / processing steps.
> 
> The filters take a token as input, and may or may not yield a token as
> a result. For example, I might have filters which lowercases the
> input, filter out boring words and filter out duplicates chained
> together.
> 
> I originally had code like this:
> for t0 in token_stream:
>   for t1 in lowercase_token(t0):
>     for t2 in remove_boring(t1):
>       for t3 in remove_dupes(t2):
>         yield t3
> 
> Apart from being ugly as sin, I only get one token out as
> StopIteration is raised before the whole token stream is consumed.
> 
> Any suggestions on an elegant way to chain together a bunch of
> generators, with processing steps in between?
> 
What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
     for t in stream:
         yield t.lower()

def remove_boring(stream):
     for t in stream:
         if t not in boring:
             yield t

def remove_dupes(stream):
     seen = set()
     for t in stream:
         if t not in seen:
             yield t
             seen.add(t)

def compound_filter(token_stream):
     stream = lowercase_token(token_stream)
     stream = remove_boring(stream)
     stream = remove_dupes(stream)
     for t in stream(t):
         yield t