Behavior of re.split on empty strings is unexpected

samwyse samwyse at
Tue Aug 3 02:53:12 CEST 2010

On Aug 2, 12:34 pm, John Nagle <na... at> wrote:
> The regular expression "split" behaves slightly differently than string
> split:

I'm going to argue that it's the string split that's behaving oddly.
To see why, let's first look at some simple CSV values:

How many fields are on each line and what are they?  Here's what
re.split(',') says:

>>> re.split(',', 'cat,dog')
['cat', 'dog']
>>> re.split(',', ',missing,,values,')
['', 'missing', '', 'values', '']

Note that the presence of missing values is clearly flagged via the
presence of empty strings in the results.  Now let's look at string

>>> 'cat,dog'.split(',')
['cat', 'dog']
>>> ',missing,,values,'.split(',')
['', 'missing', '', 'values', '']

It's the same results.  Let's try it again, but replacing the commas
with spaces.

>>> re.split(' ', 'cat dog')
['cat', 'dog']
>>> re.split(' ', ' missing  values ')
['', 'missing', '', 'values', '']
>>> 'cat dog'.split(' ')
['cat', 'dog']
>>> ' missing  values '.split(' ')
['', 'missing', '', 'values', '']

It's the same results; however many people don't like these results
because they feel that whitespace occupies a privileged role.  People
generally agree that a string of consecutive commas means missing
values, but a string of consecutive spaces just means someone held the
space-bar down too long.  To accommodate this viewpoint, the string
split is special-cased to behave differently when None is passed as a
separator.  First, it splits on any number of whitespace characters,
like this:

>>> re.split('\s+', ' missing  values ')
['', 'missing', 'values', '']
>>> re.split('\s+', 'cat dog')
['cat', 'dog']

But it also eliminates any empty strings from the head and tail of the
list, because that's what people generally expect when splitting on

>>> 'cat dog'.split(None)
['cat', 'dog']
>>> ' missing  values '.split(None)
['missing', 'values']

More information about the Python-list mailing list