Regular expression bug?

Thu Feb 19 15:03:45 EST 2009

Ron Garret wrote:
> I'm trying to split a CamelCase string into its constituent components.  
> This kind of works:
> 
>>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
> 
> but it consumes the boundary characters.  To fix this I tried using 
> lookahead and lookbehind patterns instead, but it doesn't work:
> 
>>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
> 
> However, it does seem to work with findall:
> 
>>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
> 
> So the regular expression seems to be doing the Right Thing.  Is this a 
> bug in re.split, or am I missing something?
> 
> (BTW, I tried looking at the source code for the re module, but I could 
> not find the relevant code.  re.split calls sre_compile.compile().split, 
> but the string 'split' does not appear in sre_compile.py.  So where does 
> this method come from?)
> 
> I'm using Python2.5.
> 
I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.
You could do this instead:

 >>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
['foo', 'Bar', 'Baz']