Regexp: unexspected splitting of string in several groups

Roy Smith roy at panix.com
Mon May 31 15:13:39 EDT 2004


pit.grinja at gmx.de (Piet) wrote:

> vartypePattern = re.compile("([a-zA-Z]+)(\(.*\))*([^(].*[^)])")
> [...]
> simple one-string expressions like
> vartypeSplit = vartypePattern.match("float")
> are always splitted into two strings. The result is:
> vartypeSplit.groups() = ('flo', None, 'at').

I think I see your problem.

Let's take a simplier pattern, "(a*)(a*)", that says to match "zero or 
more a's followed by zero or more a's".  If you feed it a string like 
"aaa", it's ambigious.  Any of the following matches would satisfy the 
basic pattern:

()(aaa)
(a)(aa)
(aa)(a)
(aaa)()

This is sort of what's going on here.  Your regex is:

([a-zA-Z]+)(\(.*\))*([^(].*[^)])

which breaks down into three groups:

([a-zA-Z]+)      # one or more letters
(\(.*\))*        # any string inside ()'s, zero or more times
([^(].*[^)])     # any char not '(', any string, any char not ')'

In your case, the first group matches "flo", the next group matches 
nothing, and the third group matches "at".  It's not what you expected, 
but you've got an ambigious RE, and this is one of the (many) ways the 
group matches could be satisfied.

Looking at your english description of the pattern:

> vartype is a simple string(varchar, tinyint ...) which might be
> followed by a string in curved brackets. This bracketed string is
> either composed of a single number, two numbers separated by a comma,
> or a list of strings separated by a comma. After the bracketed string,
> there might be a list of further strings (separated by blanks)

I don't think you described it right.  If the bracketed string is 
missing, and "list of further strings" is present, there needs to be 
whitespace between the vartype and the beginning of the list, right?  
I'm not completely sure this can be expressed in a single regex.  I 
suspect it can, but I also suspect it's more trouble than it's worth.

I think it would be simplier (and clearer) to break this up into a 
couple of steps.  First match the vartype and remove it from the string 
(re.split, slicing, whatever).  Then see if what you've got left starts 
with a '('.  If if does, match everything up to the ')' and remove it 
from the string.  What's left is the "list of further strings".  It's a 
bit more verbose than one huge regex, and most likely slower too, but 
it'll be a lot easier to read and debug.  You should only worry about it 
being slower if profiling shows that this is a critical section of code.



More information about the Python-list mailing list