Behavior of re.split on empty strings is unexpected
MRAB
python at mrabarnett.plus.com
Mon Aug 2 14:02:47 EDT 2010
John Nagle wrote:
> The regular expression "split" behaves slightly differently than string
> split:
>
> >>> import re
> >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
>
> >>> kresplit2.split(" HELLO THERE ")
> ['', 'HELLO', 'THERE', '']
>
> >>> kresplit2.split("VERISIGN INC.")
> ['VERISIGN', 'INC', '']
>
> I'd thought that "split" would never produce an empty string, but
> it will.
>
> The regular string split operation doesn't yield empty strings:
>
> >>> " HELLO THERE ".split()
> ['HELLO', 'THERE']
>
Yes it does.
>>> " HELLO THERE ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']
> If I try to get the functionality of string split with re:
>
> >>> s2 = " HELLO THERE "
> >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
> >>> kresplit4.split(s2)
> ['', 'HELLO', 'THERE', '']
>
> I still get empty strings.
>
> The documentation just describes re.split as "Split string by the
> occurrences of pattern", which is not too helpful.
>
It's the plain str.split() which is unusual in that:
1. it splits on sequences of whitespace instead of one per occurrence;
2. it discards leading and trailing sequences of whitespace.
Compare:
>>> " A B ".split(" ")
['', '', 'A', '', 'B', '', '']
with:
>>> " A B ".split()
['A', 'B']
It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)
More information about the Python-list
mailing list