Need a specific sort of string modification. Can someone help?

Nick Mellor thebalancepro at gmail.com
Mon Jan 7 06:28:45 CET 2013


Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters.

Brief explanation of the multi-digit version:

+/- are converted to spaces and used to split the string into sections. The split process effectively swallows the +/- characters.

The complication of multi-digits is that you need to skip the (possibly multiple) digits, which adds another stage to the calculation. In:

+3ACG. -> .

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In:

-11ACGACGACGACG. -> G.

You skip 2 + 11 characters, 2 digits in "12" and 11 letters following. And incidentally in:

+ACG. -> ACG.

there's no digit, so you skip 0 digits + 0 letters.

Having split on +/- using .translate() and .split() I use takewhile to separate the zero or more digits from the following letters. If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so dropwhile and ''.join() cover the no-digit case between them. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in 'digits'.

I was pleasantly surprised to find that using list comprehensions, zip, join (all highly optimised in Python) and several intermediate lists still works at a fairly decent speed, despite using more stages to handle multi-digits. But it is about 4x slower than the less flexible 1-digit version on my hardware (about 25,000 per second.)

Nick

On Monday, 7 January 2013 14:40:02 UTC+11, Nick Mellor  wrote:
> Hi Sia,
> 
> 
> 
> Find a multi-digit method in this version:
> 
> 
> 
> from string import maketrans
> 
> from itertools import takewhile
> 
> 
> 
> def is_digit(s): return s.isdigit()
> 
> 
> 
> class redux:
> 
> 
> 
>     def __init__(self):
> 
>        intab = '+-'
> 
>        outtab = '  '
> 
>        self.trantab = maketrans(intab, outtab)
> 
> 
> 
> 
> 
>     def reduce_plusminus(self, s):
> 
>         list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
> 
>                     for r
> 
>                     in s.translate(self.trantab).split()]
> 
>         return ''.join(list_form)
> 
> 
> 
>     def reduce_plusminus_multi_digit(self, s):
> 
>         spl = s.translate(self.trantab).split()
> 
>         digits = [list(takewhile(is_digit, r))
> 
>                    for r
> 
>                    in spl]
> 
>         numbers = [int(''.join(r)) if r else 0
> 
>                    for r
> 
>                     in digits]
> 
>         skips = [len(dig) + num for dig, num in zip(digits, numbers)]
> 
>         return ''.join([s[r:] for r, s in zip(skips, spl)])
> 
> 
> 
> if __name__ == "__main__":
> 
>     p = redux()
> 
>     print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
> 
>     print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")
> 
>     print 'multi-digit...'
> 
>     print p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG")
> 
>     print p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")
> 
> 
> 
> 
> 
> HTH,
> 
> 
> 
> Nick
> 
> 
> 
> On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> 
> > I have strings such as:
> 
> > 
> 
> > 
> 
> > 
> 
> > tA.-2AG.-2AG,-2ag
> 
> > 
> 
> > or
> 
> > 
> 
> > .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> > 
> 
> > 
> 
> > 
> 
> > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> > 
> 
> > 
> 
> > 
> 
> > tA..,
> 
> > 
> 
> > and
> 
> > 
> 
> > ...
> 
> > 
> 
> > 
> 
> > 
> 
> > How can I do that?
> 
> > 
> 
> > Thanks.



More information about the Python-list mailing list