Optional keepsep argument in str.split()
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines(): >>> 'I am\ngoing\nto...'.splitlines(keepends=True) ['I am\n', 'going\n', 'to...'] For instance: >>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3'] Regards, Marco -- Marco Buttu INAF Osservatorio Astronomico di Cagliari Loc. Poggio dei Pini, Strada 54 - 09012 Capoterra (CA) - Italy Phone: +39 070 71180255 Email: mbuttu@oa-cagliari.inaf.it
On 28/08/2013 19:42, Marco Buttu wrote:
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines():
>>> 'I am\ngoing\nto...'.splitlines(keepends=True) ['I am\n', 'going\n', 'to...']
For instance:
>>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3']
If it's a _separator_, should it be attached to the previous part? Shouldn't it be:
'python3'.split('n', keepsep=True) ['pytho', 'n', '3']
That might be why the keyword argument of .splitlines method is called 'keepends'. Usually you're not interested in the separator itself, but only in what it separates. What's your use-case?
On 2013-08-28, at 20:57 , MRAB wrote:
On 28/08/2013 19:42, Marco Buttu wrote:
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines():
>>> 'I am\ngoing\nto...'.splitlines(keepends=True) ['I am\n', 'going\n', 'to...']
For instance:
>>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3']
If it's a _separator_, should it be attached to the previous part? Shouldn't it be:
'python3'.split('n', keepsep=True) ['pytho', 'n', '3']
Which, for what it's worth, is already covered by re.split:
re.split(r"(n)", "python3") ['pytho', 'n', '3']
and the "keeping" split can be handled via findall:
re.findall(r'([^n]+(?:n|$))', "python3") ['python', '3']
On 08/28/2013 09:36 PM, Masklinn wrote:
and the "keeping" split can be handled via findall:
>re.findall(r'([^n]+(?:n|$))', "python3") ['python', '3']
Of course, but this is not built-in and not obvious. Furthermore, a regex is not so trivial as the built-in solution, as you can see:
re.findall(r'([^n]+(?:n|$))', "pythonn3") ['python', '3'] 'pythonn3'.split(sep='n', keepsep=True) ['python', 'n', '3']
Regards, -- Marco Buttu INAF Osservatorio Astronomico di Cagliari Loc. Poggio dei Pini, Strada 54 - 09012 Capoterra (CA) - Italy Phone: +39 070 71180255 Email: mbuttu@oa-cagliari.inaf.it
On 2013-08-28, at 23:14 , Marco Buttu wrote:
On 08/28/2013 09:36 PM, Masklinn wrote:
and the "keeping" split can be handled via findall:
>>re.findall(r'([^n]+(?:n|$))', "python3") ['python', '3']
Of course, but this is not built-in
How is the re module not built-in?
and not obvious.
Neither is the behavior you want out of "keepsep".
On 30/08/13 02:19, Masklinn wrote:
On 2013-08-28, at 23:14 , Marco Buttu wrote:
On 08/28/2013 09:36 PM, Masklinn wrote:
and the "keeping" split can be handled via findall:
>>> re.findall(r'([^n]+(?:n|$))', "python3") ['python', '3']
Of course, but this is not built-in
How is the re module not built-in?
py> import builtins # use __builtin__ in Python 2 py> 're' in vars(builtins) False Of course, not everything needs to be a builtin. -- Steven
On 08/28/2013 08:57 PM, MRAB wrote:
On 28/08/2013 19:42, Marco Buttu wrote:
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines():
>>> 'I am\ngoing\nto...'.splitlines(keepends=True) ['I am\n', 'going\n', 'to...']
For instance:
>>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3']
If it's a _separator_, should it be attached to the previous part? Shouldn't it be:
'python3'.split('n', keepsep=True) ['pytho', 'n', '3']
It should be attached to the previous part, exactly as my example
What's your use-case?
I think it could be useful in a lot of use-cases, when you have to parse a string. For instance, if you have some source code, and you want to write it better//:
source_code = "int a = 33;cout << a << endl;return 0;" print('\n'.join(source_code.split(';'))) int a = 33 cout << a << endl return 0
print('\n'.join(source_code.split(';', keepsep=True))) int a = 33; cout << a << endl; return 0;
-- Marco Buttu INAF Osservatorio Astronomico di Cagliari Loc. Poggio dei Pini, Strada 54 - 09012 Capoterra (CA) - Italy Phone: +39 070 71180255 Email: mbuttu@oa-cagliari.inaf.it
On Aug 28, 2013, at 12:40, Marco Buttu
On 08/28/2013 08:57 PM, MRAB wrote:
On 28/08/2013 19:42, Marco Buttu wrote:
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines():
>>> 'I am\ngoing\nto...'.splitlines(keepends=True) ['I am\n', 'going\n', 'to...']
For instance:
>>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3']
If it's a _separator_, should it be attached to the previous part? Shouldn't it be:
'python3'.split('n', keepsep=True) ['pytho', 'n', '3']
It should be attached to the previous part, exactly as my example
What's your use-case?
I think it could be useful in a lot of use-cases, when you have to parse a string. For instance, if you have some source code, and you want to write it better:
source_code = "int a = 33;cout << a << endl;return 0;" print('\n'.join(source_code.split(';'))) int a = 33 cout << a << endl return 0
print('\n'.join(source_code.split(';', keepsep=True))) int a = 33; cout << a << endl; return 0;
Split and join are inverses, so it's lossless. You get this behavior by putting the semicolon in:
print(';\n'.join(source_code.split(';'))) int a = 33; cout << a << endl; return 0;
So I'm not sure this particular use-case is compelling. Jared
On Wed, Aug 28, 2013 at 10:49 PM, Jared Grubb
Split and join are inverses, so it's lossless.
That's not true. If a separator is specified, it's lossless, but not in this case:
'a b c d'.split() ['a', 'b', 'c', 'd'] ' '.join('a b c d'.split()) 'a b c d'
I don't see a compelling use case for modifying split though. If I wanted to keep separators around, I'd probably want to work with lists like these: ['a', ' ', 'b', ' ', 'c', ' ', 'd'] or [['a', ' '], ['b', ' '], ['c', ' '], ['d', '']] and changing split to return either of those would be a bad idea. --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security
Rob Cliffe writes:
On 28/08/2013 20:40, Marco Buttu wrote:
It should be attached to the previous part, exactly as my example
If there is a leading separator in your original string, you will have to decide whether to keep it prefixed to the first element of your split list.
No, he wants it affixed to the null first element, not prefixed to the first non-null element. That's not a problem with his proposal. The problem with his proposal is that it's quite incoherent. The semantics of 'separator' is precisely that it doesn't belong to the preceding element nor to the following element, but rather is an emergent property of the juxtaposition of *two* items (either of which might be null!) The C semicolon that he uses as an example is syntactically not a separator, it's a terminator. That's precisely why he wants it affixed! Also, his "use case" isn't really one. "Nobody" really wants "a;b;c;" to become ["a;", "b;", "c;"] (consider s/a/if var/), and they "certainly" don't want "a; b; c;" to become ["a;", " b;", " c;"]. Finally, if you *do* for some reason (despite the absolute confidence that I know better than you that I display above, I'm probably wrong :-), re.find_all("[^;]*;", "a;b;c;") does exactly what you want. -1 on keepsep in str.split(). Steve
Sounds interesting...not sure about how often it'd be used, since I could
always use re:
re.split('(n)', 'python3')
On Wed, Aug 28, 2013 at 1:42 PM, Marco Buttu
What do you think about an optional `keepsep` argument in str.split(), in order to keep the separator? Something like the `keepends` of str.splitlines():
>>> 'I am\ngoing\nto...'.splitlines(**keepends=True) ['I am\n', 'going\n', 'to...']
For instance:
>>> 'python3'.split('n') ['pytho', '3'] >>> 'python3'.split('n', keepsep=True) ['python', '3']
Regards, Marco
-- Marco Buttu
INAF Osservatorio Astronomico di Cagliari Loc. Poggio dei Pini, Strada 54 - 09012 Capoterra (CA) - Italy Phone: +39 070 71180255 Email: mbuttu@oa-cagliari.inaf.it
______________________________**_________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/**mailman/listinfo/python-ideashttp://mail.python.org/mailman/listinfo/python-ideas
-- Ryan
On 08/28/2013 09:44 PM, Ryan Gonzalez wrote:
Sounds interesting...not sure about how often it'd be used, since I could always use re:
re.split('(n)', 'python3')
It is not the same. As I wrote in the first message, the separator have to be attached at its token, in the same way the srt.splitlines() `keepends` argument works:
data = "{1: 'one', 2: 'two'}{3: 'three', 4: 'four'}" import re for item in re.split('(})', data): ... print(item) ... {1: 'one', 2: 'two' } {3: 'three', 4: 'four' }
for item in data.split(sep='}', keepsep=True): ... print(item) ... {1: 'one', 2: 'two'} {3: 'three', 4: 'four'}
Regards, Marco -- Marco Buttu INAF Osservatorio Astronomico di Cagliari Loc. Poggio dei Pini, Strada 54 - 09012 Capoterra (CA) - Italy Phone: +39 070 71180255 Email: mbuttu@oa-cagliari.inaf.it
participants (9)
-
Bruce Leban
-
Jared Grubb
-
Marco Buttu
-
Masklinn
-
MRAB
-
Rob Cliffe
-
Ryan Gonzalez
-
Stephen J. Turnbull
-
Steven D'Aprano