Behaviour of str.split

David Fraser davidf at sjsoft.com
Wed Apr 20 09:03:29 EDT 2005


Bengt Richter wrote:
> On Wed, 20 Apr 2005 10:55:18 +0200, David Fraser <davidf at sjsoft.com> wrote:
> 
> 
>>Greg Ewing wrote:
>>
>>>Will McGugan wrote:
>>>
>>>
>>>>Hi,
>>>>
>>>>I'm curious about the behaviour of the str.split() when applied to 
>>>>empty strings.
>>>>
>>>>"".split() returns an empty list, however..
>>>>
>>>>"".split("*") returns a list containing one empty string.
>>>
>>>
>>>Both of these make sense as limiting cases.
>>>
>>>Consider
>>>
>>> >>> "a b c".split()
>>>['a', 'b', 'c']
>>> >>> "a b".split()
>>>['a', 'b']
>>> >>> "a".split()
>>>['a']
>>> >>> "".split()
>>>[]
>>>
>>>and
>>>
>>> >>> "**".split("*")
>>>['', '', '']
>>> >>> "*".split("*")
>>>['', '']
>>> >>> "".split("*")
>>>['']
>>>
>>>The split() method is really doing two somewhat different things
>>>depending on whether it is given an argument, and the end-cases
>>>come out differently.
>>>
>>
>>You don't really explain *why* they make sense as limiting cases, as 
>>your examples are quite different.
>>
>>Consider
>>
>>>>>"a*b*c".split("*")
>>
>>['a', 'b', 'c']
>>
>>>>>"a*b".split("*")
>>
>>['a', 'b']
>>
>>>>>"a".split("*")
>>
>>['a']
>>
>>>>>"".split("*")
>>
>>['']
>>
>>Now how is this logical when compared with split() above?
> 
> 
> The trouble is that s.split(arg) and s.split() are two different functions.
> 
> The first is 1:1 and reversible like arg.join(s.split(arg))==s
> The second is not 1:1 nor reversible: '<<various whitespace>>'.join(s.split()) == s ?? Not usually.
> 
> I think you can do it with the equivalent whitespace regex, preserving the splitout whitespace
> substrings and ''.joining those back with the others, but not with split(). I.e.,
> 
>  >>> def splitjoin(s, splitter=None):
>  ...     return (splitter is None and '<<whitespace>>' or splitter).join(s.split(splitter))
>  ...
>  >>> splitjoin('a*b*c', '*')
>  'a*b*c'
>  >>> splitjoin('a*b', '*')
>  'a*b'
>  >>> splitjoin('a', '*')
>  'a'
>  >>> splitjoin('', '*')
>  ''
>  >>> splitjoin('a b    c')
>  'a<<whitespace>>b<<whitespace>>c'
>  >>> splitjoin('a b    ')
>  'a<<whitespace>>b'
>  >>> splitjoin('  b    ')
>  'b'
>  >>> splitjoin('')
>  ''
> 
>  >>> splitjoin('*****','*')
>  '*****'
> Note why that works:
> 
>  >>> '*****'.split('*')
>  ['', '', '', '', '', '']
>  >>> '*a'.split('*')
>  ['', 'a']
>  >>> 'a*'.split('*')
>  ['a', '']
> 
>  >>> splitjoin('*a','*')
>  '*a'
>  >>> splitjoin('a*','*')
>  'a*'

Thanks, this makes sense.
So ideally if we weren't dealing with backward compatibility these 
functions might have different names... "split" (with arg) and 
"spacesplit" (without arg)
In fact it would be nice to allow an argument to "spacesplit" specifying 
the characters regarded as 'space'
But all not worth breaking current code :-)

David



More information about the Python-list mailing list