[Python-Dev] re.split on empty patterns
A.M. Kuchling
amk at amk.ca
Sat Aug 7 16:51:42 CEST 2004
The re.split() method ignores zero-length pattern matches. Patch
#988761 adds an emptyok flag to split that causes zero-length matches
to trigger a split. For example:
>>> re.split(r'\b', 'this is a sentence')# does nothing; \b is always length 0
['this is a sentence']
>>> re.split(r'\b', 'this is a sentence', emptyok=True)
['', 'this', ' ', 'is', ' ', 'a', ' ', 'sentence', '']
Without the patch, the various zero-length assertions are
pretty useless; with it, they can serve a purpose with split():
>>> re.split(r'(?m)$', 'line1\nline2\n', emptyok=True)
['line1', '\nline2', '\n', '']
>>> # Split file into sections
>>> re.split("(?m)(?=^[[])", """[section1]
foo=bar
[section2]
coyote=wiley
""", emptyok=True)
['', '[section1]\nfoo=bar\n\n', '[section2]\ncoyote=wiley\n']
Zero-length matches often result in a '' at the beginning or end, or
between characters, but I think users can handle that. IMHO this
feature is clearly useful, and would be happy to commit the patch
as-is.
Question: do we want to make this option the new default? Existing
patterns that can produce zero-length matches would change their
meanings:
>>> re.split('x*', 'abxxxcdefxxx')
['ab', 'cdef', '']
>>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']
(I think the result of the second match points up a bug in the patch;
the empty strings in the middle seem wrong to me. Assume that gets
fixed.)
Anyway, we therefore can't just make this the default in 2.4. We
could trigger a warning when emptyok is not supplied and a split
pattern results in a zero-length match; users could supply
emptyok=False to avoid the warning. Patterns that never have a
zero-length match would never get the warning. 2.5 could then set
emptyok to True.
Note: raising the warning might cause a serious performance hit for
patterns that get zero-length matches a lot, which would make 2.4
slower in certain cases.
Thoughts? Does this need a PEP?
--amk
More information about the Python-Dev
mailing list