[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell
report at bugs.python.org
Tue May 18 09:13:51 EDT 2021
Mark Bell <mark00bell at googlemail.com> added the comment:
So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag).
I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty
However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example,
' x y z'.split(maxsplit=1, keepempty=True)
results in
['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right?
I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays.
''.split() == []
''.split(' ') == ['']
''.split(' ', keepempty=False) == [] # New case
' '.split(' ') == ['', '', '']
' '.split(' ', maxsplit=1) == ['', ' ']
' '.split(' ', maxsplit=1, keepempty=False) == [' '] # New case
' a b c '.split() == ['a', 'b', 'c']
' a b c '.split(maxsplit=0) == ['a b c ']
' a b c '.split(maxsplit=1) == ['a', 'b c ']
' a b c '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
' a b c '.split(' ', maxsplit=0) == [' a b c ']
' a b c '.split(' ', maxsplit=1) == ['', ' a b c ']
' a b c '.split(' ', maxsplit=2) == ['', '', 'a b c ']
' a b c '.split(' ', maxsplit=3) == ['', '', 'a', 'b c ']
' a b c '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c ']
' a b c '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
' a b c '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']
' a b c '.split(' ', keepempty=False) == ['a', 'b', 'c'] # New case
' a b c '.split(' ', maxsplit=0, keepempty=False) == [' a b c '] # New case
' a b c '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c '] # New case
' a b c '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c '] # New case
' a b c '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' '] # New case
' a b c '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c'] # New case
----------
nosy: +Mark.Bell
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue28937>
_______________________________________
More information about the Python-bugs-list
mailing list