Candidate for a new itertool

pruebauno at latinmail.com pruebauno at latinmail.com
Thu Mar 12 18:36:30 CET 2009

```On Mar 7, 8:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function.  The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful.  Suggestions are welcome.
>
> Raymond
>
> -----------------------------------------
>
> from itertools import groupby
>
> def split_on(iterable, event, start=True):
>     'Split iterable on event boundaries (either start events or stop
> events).'
>     # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
>     # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
>     def transition_counter(x, start=start, cnt=[0]):
>         before = cnt[0]
>         if event(x):
>             cnt[0] += 1
>         after = cnt[0]
>         return after if start else before
>     return (g for k, g in groupby(iterable, transition_counter))
>
> if __name__ == '__main__':
>     for start in True, False:
>         for g in split_on('X1X23X456X', 'X'.__eq__, start):
>             print list(g)
>         print
>
>     from pprint import pprint
>     boundary = '--===============2615450625767277916==\n'
>     email = open('email.txt')
>     for mime_section in split_on(email, boundary.__eq__):
>         pprint(list(mime_section, 1, None))
>         print '= = ' * 30

For me your examples don't justify why you would need such a general
algorithm. A split function that works on iterables instead of just
strings seems straightforward, so maybe we should have that and
another one function with examples of problems where a plain split
does not work.
Something like this should work for the two examples you gave were the
boundaries are a known constants (and therefore there is really no
need to keep them. I can always add them later):

def split_on(iterable, boundary):
l=[]
for el in iterable:
if el!=boundary:
l.append(el)
else:
yield l
l=[]
yield l

def join_on(iterable, boundary):
it=iter(iterable)
firstel=it.next()
for el in it:
yield boundary
for x in el:
yield x

if __name__ == '__main__':
lst=[]
for g in split_on('X1X23X456X', 'X'):
print list(g)
lst.append(g)
print
print list(join_on(lst,'X'))

```