Splitting a sequence into pieces with identical elements

MRAB python at mrabarnett.plus.com
Tue Aug 10 21:30:49 EDT 2010


Tim Chase wrote:
> On 08/10/10 19:37, candide wrote:
>> Suppose you have a sequence s , a string  for say, for instance this 
>> one :
>>
>> spppammmmegggssss
>>
>> We want to split s into the following parts :
>>
>> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>
>> ie each part is a single repeated character word.
> 
> While I'm not sure it's idiomatic, the overabuse of regexps in Python 
> certainly seems prevalent enough to be idiomatic ;-)
> 
> As such, you can use:
> 
>   import re
>   r = re.compile(r'((.)\1*)')
>   #r = re.compile(r'((\w)\1*)')

That should be \2, not \1.

Alternatively:

     r = re.compile(r'(.)\1*')
     #r = re.compile(r'(\w)\1*')

>   s = 'spppammmmegggssss'
>   results = [m.group(0) for m in r.finditer(s)]
> 
> Additionally, you have all the properties of the match-object (which 
> includes the start/end) available too if you need).
> 
> You don't specify what you want to have happen with non-letters 
> (whitespace, punctuation, etc).  The above just treats them like any 
> other character, finding repeats.  If you just want "word" characters, 
> you can use the 2nd ("\w") version, or adjust accordingly.
> 



More information about the Python-list mailing list