regex help: splitting string gets weird groups

MRAB python at mrabarnett.plus.com
Thu Apr 8 15:40:17 EDT 2010


gry wrote:
> [ python3.1.1, re.__version__='2.2.1' ]
> I'm trying to use re to split a string into (any number of) pieces of
> these kinds:
> 1) contiguous runs of letters
> 2) contiguous runs of digits
> 3) single other characters
> 
> e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
> '.', 'in', '#', '=', 1234]
> I tried:
>>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
> ('1234', 'in', '1234', '=')
> 
> Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
> group?  Is my regexp illegal somehow and confusing the engine?
> 
> I *would* like to understand what's wrong with this regex, though if
> someone has a neat other way to do the above task, I'm also interested
> in suggestions.

If the regex was illegal then it would raise an exception. It's doing
exactly what you're asking it to do!

First of all, there are 4 groups, with group 1 containing groups 2..4 as
alternatives, so group 1 will match whatever groups 2..4 match:

Group 1: (([A-Za-z]+)|([0-9]+)|([-.#=]))
Group 2: ([A-Za-z]+)
Group 3: ([0-9]+)
Group 4: ([-.#=])

It matches like this:

Group 1 and group 3 match '555'.
Group 1 and group 2 match 'tHe'.
Group 1 and group 4 match '-'.
Group 1 and group 2 match 'rain'.
Group 1 and group 4 match '.'.
Group 1 and group 2 match 'in'.
Group 1 and group 4 match '#'.
Group 1 and group 4 match '='.
Group 1 and group 3 match '1234'.

If a group matches then any earlier match of that group is discarded,
so:

Group 1 finishes with '1234'.
Group 2 finishes with 'in'.
Group 3 finishes with '1234'.
Group 4 finishes with '='.

A solution is:

 >>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', '555tHe-rain.in#=1234')
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

Note: re.findall() returns a list of matches, so if the regex doesn't
contain any groups then it returns the matched substrings. Compare:

 >>> re.findall("a(.)", "ax ay")
['x', 'y']
 >>> re.findall("a.", "ax ay")
['ax', 'ay']



More information about the Python-list mailing list