Extracting subsequences composed of the same character

candide candide at free.invalid
Fri Apr 1 02:43:38 CEST 2011


Suppose you have a string, for instance

"pyyythhooonnn ---> ++++"

and you search for the subquences composed of the same character, here 
you get :

'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

It's not difficult to write a Python code that solves the problem, for 
instance :

def f(text):
     ch=text
     r=[]
     if not text:
         return r
     else:
         x=ch[0]
         i=0
         for c in ch:
             if c!=x:
                 if i>1:
                     r+=[x*i]
                 x=c
                 i=1
             else:
                 i+=1
     return r+(i>1)*[i*x]

print f("pyyythhooonnn ---> ++++")


I should confess that this code is rather cumbersome so I was looking 
for an alternative. I imagine that a regular expressions approach could 
provide a better method. Does a such code exist ?  Note that the string 
is not restricted to the ascii charset.



More information about the Python-list mailing list