Regular expression bug?

Thu Feb 19 14:26:46 EST 2009

On Thu, 2009-02-19 at 10:55 -0800, Ron Garret wrote:
> I'm trying to split a CamelCase string into its constituent components.  
> This kind of works:
> 
> >>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
> 
> but it consumes the boundary characters.  To fix this I tried using 
> lookahead and lookbehind patterns instead, but it doesn't work:

That's how re.split works, same as str.split...

> >>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
> 
> However, it does seem to work with findall:
> 
> >>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']

Wow!

To tell you the truth, I can't even read that... but one wonders why
don't you just do

def ccsplit(s):
    cclist = []
    current_word = ''
    for char in s:
        if char in string.uppercase:
            if current_word:
                cclist.append(current_word)
            current_word = char
        else:
            current_word += char
    if current_word:
        ccl.append(current_word)
    return cclist

>>> ccsplit('fooBarBaz')
--> ['foo', 'Bar', 'Baz']

This is arguably *much* more easy to read than the re example doesn't
require one to look ahead in the string.

-a