Regular expression bug?
Peter Otten
__peter__ at web.de
Thu Feb 19 14:52:58 EST 2009
Ron Garret wrote:
> I'm trying to split a CamelCase string into its constituent components.
How about
>>> re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']
> This kind of works:
>
>>>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
>
> but it consumes the boundary characters. To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
>
>>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
>
> However, it does seem to work with findall:
>
>>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
>
> So the regular expression seems to be doing the Right Thing. Is this a
> bug in re.split, or am I missing something?
IRC the split pattern must consume at least one character, but I can't find
the reference.
> (BTW, I tried looking at the source code for the re module, but I could
> not find the relevant code. re.split calls sre_compile.compile().split,
> but the string 'split' does not appear in sre_compile.py. So where does
> this method come from?)
It's coded in C. The source is Modules/sremodule.c.
Peter
More information about the Python-list
mailing list