String Splitter Brain Teaser

Mon Mar 28 12:40:51 EST 2005

On Mon, 28 Mar 2005 09:18:38 -0800, Michael Spencer
<mahs at telcopartners.com> wrote:
> Bill Mill wrote:
> 
> > for very long genomes he might want a generator:
> >
> > def xgen(s):
> >     l = len(s) - 1
> >     e = enumerate(s)
> >     for i,c in e:
> >         if i < l and s[i+1] == '/':
> >             e.next()
> >             i2, c2 = e.next()
> >             yield [c, c2]
> >         else:
> >             yield [c]
> >
> >
> >>>>for g in xgen('ATT/GATA/G'): print g
> >
> > ...
> > ['A']
> > ['T']
> > ['T', 'G']
> > ['A']
> > ['T']
> > ['A', 'G']
> >
> > Peace
> > Bill Mill
> > bill.mill at gmail.com
> 
> works according to the original spec, but there are a couple of issues:
> 
> 1. the output is specified to be a list, so delaying the creation of the list
> isn't a win

True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes.

> 
> 2. this version fails down in the presence of "double degeneracies" (if that's
> what they should be called) - which were not in the OP spec, but which cropped
> up in a later post :
>   >>> list(xgen("AGC/C/TGA/T"))
>   [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
    e = enumerate(s)
    stack = [e.next()[1]] #push the first char into the stack
    for i,c in e:
        if c != '/':
            yield stack
            stack = [c]
        else:
            stack.append(e.next()[1])
    yield stack

>>> gn
'ATT/GATA/G/AT'
>>> for g in xgen(gn): print g
...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']

Peace
Bill Mill
bill.mill at gmail.com