Regular expression bug?
Kurt Smith
kwmsmith at gmail.com
Thu Feb 19 14:41:17 EST 2009
On Thu, Feb 19, 2009 at 12:55 PM, Ron Garret <rNOSPAMon at flownet.com> wrote:
> I'm trying to split a CamelCase string into its constituent components.
> This kind of works:
>
>>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
>
> but it consumes the boundary characters. To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
>
>>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
>
> However, it does seem to work with findall:
>
>>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
>
> So the regular expression seems to be doing the Right Thing. Is this a
> bug in re.split, or am I missing something?
>From what I can tell, re.split can't split on zero-length boundaries.
It needs something to split on, like str.split. Is this a bug?
Possibly. The docs for re.split say:
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
Note that it does not say that zero-length matches won't work.
I can work around the problem thusly:
re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_')
Which is ugly. I reckon you can use re.findall with a pattern that
matches the components and not the boundaries, but you have to take
care of the beginning and end as special cases.
Kurt
More information about the Python-list
mailing list