[issue28937] str.split(): allow removing empty strings (when sep is not None)

Mark Bell report at bugs.python.org
Tue May 18 09:13:51 EDT 2021


Mark Bell <mark00bell at googlemail.com> added the comment:

So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag). 

I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty

However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example, 
    '   x y z'.split(maxsplit=1, keepempty=True)
results in
    ['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right?

I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays.

    ''.split() == []
    ''.split(' ') == ['']
    ''.split(' ', keepempty=False) == []    # New case

    '  '.split(' ') == ['', '', '']
    '  '.split(' ', maxsplit=1) == ['', ' ']
    '  '.split(' ', maxsplit=1, keepempty=False) == [' ']    # New case

    '  a b c  '.split() == ['a', 'b', 'c']
    ​'  a b c  '.split(maxsplit=0) == ['a b c  ']
    ​'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

    '  a b c  '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
    ​'  a b c  '.split(' ', maxsplit=0) == ['  a b c  ']
    ​'  a b c  '.split(' ', maxsplit=1) == ['', ' a b c  ']
    ​'  a b c  '.split(' ', maxsplit=2) == ['', '', 'a b c  ']
    ​'  a b c  '.split(' ', maxsplit=3) == ['', '', 'a', 'b c  ']
    ​'  a b c  '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c  ']
    ​'  a b c  '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
    ​'  a b c  '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']

    ​'  a b c  '.split(' ', keepempty=False) == ['a', 'b', 'c']    # New case
    ​'  a b c  '.split(' ', maxsplit=0, keepempty=False) == ['  a b c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' ']    # New case
    ​'  a b c  '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c']    # New case

----------
nosy: +Mark.Bell

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue28937>
_______________________________________


More information about the Python-bugs-list mailing list