Candidate for a new itertool

Sun Mar 8 22:54:01 EDT 2009

On Mar 7, 8:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function.  The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful.  Suggestions are welcome.
>
> Raymond
>
> -----------------------------------------
>
> from itertools import groupby
>
> def split_on(iterable, event, start=True):
>     'Split iterable on event boundaries (either start events or stop
> events).'
>     # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
>     # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
>     def transition_counter(x, start=start, cnt=[0]):
>         before = cnt[0]
>         if event(x):
>             cnt[0] += 1
>         after = cnt[0]
>         return after if start else before
>     return (g for k, g in groupby(iterable, transition_counter))
>
> if __name__ == '__main__':
>     for start in True, False:
>         for g in split_on('X1X23X456X', 'X'.__eq__, start):
>             print list(g)
>         print
>
>     from pprint import pprint
>     boundary = '--===============2615450625767277916==\n'
>     email = open('email.txt')
>     for mime_section in split_on(email, boundary.__eq__):
>         pprint(list(mime_section, 1, None))
>         print '= = ' * 30

I've found this type of splitting quite useful when grouping sections
of a text file. I used the groupby function directly in the file, when
i would have rather used something like this.

However, I wonder if it would be helpful to break that function into
two instead of having the "start" flag. The flag feels odd to me
(maybe it's the name?), and the documentation might have a better feel
to it, coming from a newcomer's perspective. Also, it would be cool if
the function took keywords; I wonder why most of the other functions
in the itertools module don't take keywords.

I wouldn't split out the keys separately from the groups. But the idea
of a flag to exclude the keys sounds interesting to me.

Thank you for giving me the opportunity to use the nonlocal keyword
for the first time since trying out Python 3.0. I hope this is an
appropriate usage:

def split_on(iterable, key=bool, start=True):
   'Split iterable on boundaries (either start events or stop
events).'
   # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
   # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
   flag = 0

   def event_marker(x, start_flag=start):
       nonlocal flag, key
       before = flag
       if key(x):
           flag += 1
       after = flag

       return after if start_flag else before

   return (g for k, g in it.groupby(iterable, key=event_marker))