Regular expression bug?

Ron Garret rNOSPAMon at flownet.com
Thu Feb 19 22:02:39 CET 2009


In article <mailman.281.1235073821.11746.python-list at python.org>,
 MRAB <google at mrabarnett.plus.com> wrote:

> Ron Garret wrote:
> > I'm trying to split a CamelCase string into its constituent components.  
> > This kind of works:
> > 
> >>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> > ['fo', 'a', 'az']
> > 
> > but it consumes the boundary characters.  To fix this I tried using 
> > lookahead and lookbehind patterns instead, but it doesn't work:
> > 
> >>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> > ['fooBarBaz']
> > 
> > However, it does seem to work with findall:
> > 
> >>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> > ['', '']
> > 
> > So the regular expression seems to be doing the Right Thing.  Is this a 
> > bug in re.split, or am I missing something?
> > 
> > (BTW, I tried looking at the source code for the re module, but I could 
> > not find the relevant code.  re.split calls sre_compile.compile().split, 
> > but the string 'split' does not appear in sre_compile.py.  So where does 
> > this method come from?)
> > 
> > I'm using Python2.5.
> > 
> I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
> might be intentional, but changing it could break some existing code.

That seems unlikely.  It would only break where people had code invoking 
re.split on empty matches, which at the moment is essentially a no-op.  
It's hard to imagine there's a lot of code like that around.  What would 
be the point?

> You could do this instead:
> 
>  >>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
> ['foo', 'Bar', 'Baz']

Blech!  ;-)  But thanks for the suggestion.

rg



More information about the Python-list mailing list