regex help: splitting string gets weird groups

Thu Apr 8 16:02:22 EDT 2010

gry wrote:
> [ python3.1.1, re.__version__='2.2.1' ]
> I'm trying to use re to split a string into (any number of) pieces of
> these kinds:
> 1) contiguous runs of letters
> 2) contiguous runs of digits
> 3) single other characters
> 
> e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
> '.', 'in', '#', '=', 1234]
> I tried:
>>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
> ('1234', 'in', '1234', '=')
> 
> Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
> group?  Is my regexp illegal somehow and confusing the engine?

well, I'm not sure what it thinks its finding but nested capture-groups 
always produce somewhat weird results for me (I suspect that's what's 
triggering the duplication).  Additionally, you're only searching for 
one match (.match() returns a single match-object or None; not all 
possible matches within the repeated super-group).

> I *would* like to understand what's wrong with this regex, though if
> someone has a neat other way to do the above task, I'm also interested
> in suggestions.

Tweaking your original, I used

   >>> s='555tHe-rain.in#=1234'
   >>> import re
   >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
   >>> r.findall(s)
   ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The only difference between my results and your results is that the 555 
and 1234 come back as strings, not ints.

-tkc