Best practice for operations on streams of text

Thu May 7 18:32:25 EDT 2009

MRAB wrote:
> James wrote:
>> Hello all,
>> I'm working on some NLP code - what I'm doing is passing a large
>> number of tokens through a number of filtering / processing steps.
>>
>> The filters take a token as input, and may or may not yield a token as
>> a result. For example, I might have filters which lowercases the
>> input, filter out boring words and filter out duplicates chained
>> together.
>>
>> I originally had code like this:
>> for t0 in token_stream:
>>   for t1 in lowercase_token(t0):
>>     for t2 in remove_boring(t1):
>>       for t3 in remove_dupes(t2):
>>         yield t3

For that to work at all, the three functions would have to turn each 
token into an iterable of 0 or 1 tokens.  Hence the inner 'loops' would 
execute 0 or 1 times.  Better to return a token or None, and replace the 
three inner 'loops' with three conditional statements (ugly too) or less 
efficiently (due to lack of short circuiting),

t = remove_dupes(remove_boring(lowercase_token(t0)))
if t is not None: yield t

>> Apart from being ugly as sin, I only get one token out as
>> StopIteration is raised before the whole token stream is consumed.

That puzzles me.  Your actual code must be slightly different from the 
above and what I imagine the functions to be.  But nevermind, because

>> Any suggestions on an elegant way to chain together a bunch of
>> generators, with processing steps in between?

MRAB's suggestion is the way to go.  Your automatically get 
short-circuiting because each generator only gets what is passed on. 
And resuming a generator is much faster that re-calling a function.

> What you should be doing is letting the filters accept an iterator and
> yield values on demand:
> 
> def lowercase_token(stream):
>     for t in stream:
>         yield t.lower()
> 
> def remove_boring(stream):
>     for t in stream:
>         if t not in boring:
>             yield t
> 
> def remove_dupes(stream):
>     seen = set()
>     for t in stream:
>         if t not in seen:
>             yield t
>             seen.add(t)
> 
> def compound_filter(token_stream):
>     stream = lowercase_token(token_stream)
>     stream = remove_boring(stream)
>     stream = remove_dupes(stream)
>     for t in stream(t):
>         yield t

I also recommend the Beazly reference Herron gave.

tjr