Proposal: Tuple of str with w'list of words'
Hi, I looked around for a while but didn't see this proposed anywhere. I apologize if I missed an existing discussion. I do a fair amount of work with pandas and data munging. This means that I'm often doing things like: mydf = df[ ['field1', 'field2', 'field3' ] ] This is a little ugly, so if the list is long enough, I do: mydf=df[ 'field1 field2 field3'.split() ] This is a little more readable, but still a bit ugly. What I'm proposing here is: mydf = df[ w'field1 field2 field3' ] This would be identical in all ways (compile-time) to: mydf = df[ ('field1', 'field2', 'field3') ] This should work with all the python quote variations (w''', w""", etc). The only internal escapes are \\ indicating a \ and <backslash><space> indicating a non-splitting space: songs = w'My\ Bloody\ Valentine Blue\ Suede\ Shoes' One question is whether to have w'' be a list or a tuple. I leaned slightly towards tuple because it's faster on internal loops: In [1]: %timeit a=('this','is','a','test') 100000000 loops, best of 3: 11.3 ns per loop In [2]: %timeit a=['this','is','a','test'] 10000000 loops, best of 3: 74.3 ns per loop However, I mostly see lists used in the data science community, so it's a little less convenient: other_fields = df.columns[-3:] new_columns = w'field1 field2' + other_fields # ERROR - can't concatenate list to tuple new_columns = list(w'field1 field2') + other_fields I honestly could go either way with lists or tuples. Other Languages: perl has the qw operator: @a = qw(field1 field2 field3); ruby has %w a=%w{field1 field2} Thanks for reading this far :-) Regards, Gary Godfrey Austin, TX
Just spend the extra two characters to do this with existing syntax: w('field1 field2 field3'). Implementation of the w() function is trivial. On Nov 12, 2016 9:04 AM, "Gary Godfrey" <g.pythonideas@wamp.us> wrote:
Hi,
I looked around for a while but didn't see this proposed anywhere. I apologize if I missed an existing discussion.
I do a fair amount of work with pandas and data munging. This means that I'm often doing things like:
mydf = df[ ['field1', 'field2', 'field3' ] ]
This is a little ugly, so if the list is long enough, I do:
mydf=df[ 'field1 field2 field3'.split() ]
This is a little more readable, but still a bit ugly. What I'm proposing here is:
mydf = df[ w'field1 field2 field3' ]
This would be identical in all ways (compile-time) to:
mydf = df[ ('field1', 'field2', 'field3') ]
This should work with all the python quote variations (w''', w""", etc). The only internal escapes are \\ indicating a \ and <backslash><space> indicating a non-splitting space:
songs = w'My\ Bloody\ Valentine Blue\ Suede\ Shoes'
One question is whether to have w'' be a list or a tuple. I leaned slightly towards tuple because it's faster on internal loops:
In [1]: %timeit a=('this','is','a','test') 100000000 loops, best of 3: 11.3 ns per loop
In [2]: %timeit a=['this','is','a','test'] 10000000 loops, best of 3: 74.3 ns per loop
However, I mostly see lists used in the data science community, so it's a little less convenient:
other_fields = df.columns[-3:] new_columns = w'field1 field2' + other_fields # ERROR - can't concatenate list to tuple new_columns = list(w'field1 field2') + other_fields
I honestly could go either way with lists or tuples.
Other Languages:
perl has the qw operator:
@a = qw(field1 field2 field3);
ruby has %w
a=%w{field1 field2}
Thanks for reading this far :-)
Regards, Gary Godfrey Austin, TX
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Nov 12, 2016 at 11:33 AM David Mertz <mertz@gnosis.cx> wrote:
Just spend the extra two characters to do this with existing syntax: w('field1 field2 field3'). Implementation of the w() function is trivial.
I've done that as well and see that done. Sometimes with s('field1 field2'), sometimes with other characters. Sometimes with two letter functions. Part of what I'd like to do here is make it so there's one way that most people do this. If you look through data analytics books and examples, the default is ['field1', 'field2' 'field3']. That's a lot of little characters to type correctly and it's error prone (for instance, did you immediately notice that I was missing a "," in that example? Python will silently make that ['field1','field2field3']). Cheers, Gary
On Sat, Nov 12, 2016 at 05:01:00PM +0000, Gary Godfrey wrote:
I do a fair amount of work with pandas and data munging. This means that I'm often doing things like:
mydf = df[ ['field1', 'field2', 'field3' ] ]
This is a little ugly, so if the list is long enough, I do:
mydf=df[ 'field1 field2 field3'.split() ]
I consider the need for that to indicate a possibly poor design of pandas. Unless there is a good reason not to, I believe that any function that requires a list of strings should also accept a single space-delimited string instead. Especially if the strings are intended as names or labels. So that: func(['fe', 'fi', 'fo', 'fum']) and func('fe fi fo fum') should be treated the same way. Of course, it may be that pandas has a good reason for not supporting that. But in general, we don't change the language to make up for deficiencies in third-party library functionality.
This is a little more readable, but still a bit ugly.
I don't agree that its ugly. I think that 'fe fi fo fum'.split() is nicely explicit about what it is doing. It's also a candidate for compile-time optimization since the argument is a literal.
What I'm proposing here is:
mydf = df[ w'field1 field2 field3' ]
This would be identical in all ways (compile-time) to:
mydf = df[ ('field1', 'field2', 'field3') ]
Are your field names usually constants known when you write the script? I would have thought they'd more often be variables that you read from your data.
This should work with all the python quote variations (w''', w""", etc). The only internal escapes are \\ indicating a \ and <backslash><space> indicating a non-splitting space:
So not only do we have to learn yet another special kind of string: - unicode strings - byte strings - raw strings (either unicode or bytes) - f-strings - and now w-strings but this one has different escaping rules from the others. I expect that there will be a huge number of confused questions about why people cannot use standard escapes in their "word" strings.
songs = w'My\ Bloody\ Valentine Blue\ Suede\ Shoes'
I think that escaping spaces like that will be an attractive nuisance. I had to read your example three times before I noticed that the space between Valentine and Blue was not escaped. I would prefer a simple, straight-forward rule: it unconditionally splits on whitespace. If you need to include non-splitting spaces, use a proper non-breaking space \u00A0, or split the words into a tuple by hand, like you're doing now. I don't think it is worth complicating the feature to support non-splitting spaces. (Hmmm... I see that str.split() currently splits on non-breaking spaces. That feels wrong to me: although the NBSP character is considered whitespace, its whole purpose is to avoid splitting.)
Other Languages:
perl has the qw operator:
@a = qw(field1 field2 field3);
ruby has %w
a=%w{field1 field2}
The fact that other languages do something like this is a (weak) point in its favour. But I see that there are a few questions on Stackoverflow asking what %w means, how it is different from %W, etc. For example: http://stackoverflow.com/questions/1274675/what-does-warray-mean http://stackoverflow.com/questions/690794/ruby-arrays-w-vs-w and I notice this comment from the second link: "%w" is my usual retort to people who get a little too cocky about the readability of Ruby. Works every time. That's a point against this proposal: the feature seems to be a bit puzzling to users in languages that implement it (at least Ruby). I'm rather luke-warm on this proposal, although I might be convinced to support it if: - w'...' unconditionally split on any whitespace (possibly excluding NBSP); - and normal escapes worked. Even then I'm not really convinced this needs to be a language feature. -- Steve
On Sat, Nov 12, 2016 at 12:06 PM Steven D'Aprano <steve@pearwood.info> wrote:
I consider the need for that to indicate a possibly poor design of pandas. Unless there is a good reason not to, I believe that any function that requires a list of strings should also accept a single space-delimited string instead. Especially if the strings are intended as names or labels. So that:
func(['fe', 'fi', 'fo', 'fum'])
and
func('fe fi fo fum')
should be treated the same way.
They don't because df[ 'Column Name'] is a valid way to get a single column worth of data when the column name contains spaces (not encouraged, but it is valid).
mydf = df[ ('field1', 'field2', 'field3') ]
Are your field names usually constants known when you write the script?
Yes. All the time. When I'm on the side of creating APIs for data analysts to use, I think of the columns abstractly. When they're writing scripts to analyze data, it's all very explicit and in the domain of the data. Things like: df [df.age > 10] adf = df.pivot_table( ['runid','block'] ) Are common and the "right" way to do things in the problem domain.
So not only do we have to learn yet another special kind of string:
- unicode strings - byte strings - raw strings (either unicode or bytes) - f-strings - and now w-strings
Very valid point. I also was considering (and rejected) a 'wb' for tuple of bytes.
I would prefer a simple, straight-forward rule: it unconditionally splits on whitespace. If you need to include non-splitting spaces, use a proper non-breaking space \u00A0, or split the words into a tuple by hand, like you're doing now. I don't think it is worth complicating the feature to support non-splitting spaces.
You're right there. If there are spaces in the columns, make it explicit and don't use the w''. I withdraw the <backspace><space> "feature". And I think you're right that all the existing escape rules should work in the same way they do for regular unicode strings (don't go the raw strings route). Basically, w'foo bar' == tuple('foo bar'.split())
The fact that other languages do something like this is a (weak) point in its favour. But I see that there are a few questions on Stackoverflow asking what %w means, how it is different from %W, etc. For example:
http://stackoverflow.com/questions/1274675/what-does-warray-mean
http://stackoverflow.com/questions/690794/ruby-arrays-w-vs-w
Well, I'd lean towards not having a W'fields' that does something funky :-). But your point is well taken.
... I'm rather luke-warm on this proposal, although I might be convinced to support it if:
- w'...' unconditionally split on any whitespace (possibly excluding NBSP);
- and normal escapes worked.
Even then I'm not really convinced this needs to be a language feature.
I'm realizing that a lot of the reason that I'm seeing this a lot is that it seems to be particular issue to using python for data science. In some ways, they're pushing the language a bit beyond what it's designed to do (the df[ (df.age > 10) & (df.gender=="F")] idiom is amazing and troubling). Since I'm doing a lot of this, these little language issues loom a bit larger than they would with "normal" programming. Thanks for responding.
mydf = df[ ['field1', 'field2', 'field3' ] ] mydf=df[ 'field1 field2 field3'.split() ]
I consider the need for that to indicate a possibly poor design of pandas. Unless there is a good reason not to, I believe that any function that requires a list of strings should also accept a single space-delimited string instead. Of course, it may be that pandas has a good reason for not supporting that. But in general, we don't change the language to make up for deficiencies in third-party library functionality.
Yes... Pandas has a very good reason. And the general principle is wrong too. A list of strings can contain strings with spaces in them. It's only in rare cases like collections.namedtuple where the whole point of strings is to be valid Python identifiers that you can rule out spaces being inside them. E.g.:
df = pd.DataFrame({"Quant":[1,2,3], "Approx Price":[4,5,6], "Weight":[7,8,9]}) df[['Quant','Approx Price']] Quant Approx Price 0 1 4 1 2 5 2 3 6 df['Quant Approx Price'.split()]] ... KeyError: "['Approx' 'Price'] not in index"
The hypothetical w-string would have the same ambiguity with w'Quant Approx Price'. Even with namedtuple, you sometimes have to massage strings to get rid of spaces (and other stray punctuation), e.g.:
from collections import namedtuple Item = namedtuple('Item', 'Quant Approx_Price Weight') Item(4,5,6) Item(Quant=4, Approx_Price=5, Weight=6)
I use namedtuple fairly often dynamically, for example to read a novel CSV file or query a novel SQL table. When I do so, I need to check the "columns" for being valid identifiers, and then massage them to make them so. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Sat, Nov 12, 2016, at 13:05, Steven D'Aprano wrote:
I'm rather luke-warm on this proposal, although I might be convinced to support it if:
- w'...' unconditionally split on any whitespace (possibly excluding NBSP);
- and normal escapes worked.
Is there any particular objection to allowing the backslash-space escape (and for escapes that mean whitespace characters, such as \t, \x20, to not split, if you meant to imply that they do)? That would provide the extra push to this being beneficial over split(). I also have an alternate idea: sl{word1 word2 'string 3' "string 4"}
Random832 writes:
Is there any particular objection to allowing the backslash-space escape (and for escapes that mean whitespace characters, such as \t, \x20, to not split, if you meant to imply that they do)? That would provide the extra push to this being beneficial over split().
You're suggesting that (1) most escapes would be processed after splitting while (2) backslash-space (what about backslash-tab?) would be treated as an escape during splitting?
I also have an alternate idea: sl{word1 word2 'string 3' "string 4"}
word1 and word2 are what perl would term "barewords"? Ie treated as strings? -1 to w"", -1 to inconsistent interpretation of escapes, and -1 to a completely new syntax. " ", "\x20", "\u0020", and "\U00000020" currently are different representations of the same string, so it would be confusing if the same notations meant different things in this context. Another syntax plus overloading standard string notation with yet another semantics (strings, rawstrings) doesn't seem like a win to me. As I accept the usual Pythonic aversion to mere abbreviations, I don't see any benefit to these notations, except for the case where a list just won't do, so you can avoid a call to tuple. We already have three good ways to do this: wordlist = ["word1", "word2", "string 3", "string 4"] wordlist = "word1,word2,string 3,string 4".split(",") wordlist = open(word_per_line_file).readlines() and for maximum Unicode-conforming generality with compact notation: wordlist = "word1\UFFFFword2\UFFFFstring 3\UFFFFstring 4".split("\UFFFF") More seriously, in most use cases there will be ASCII control characters that you could use, which most editors can enter (though they might be visually unattractive in many editors, eg, \x0C). Steve
On Tue, Dec 6, 2016, at 19:51, Stephen J. Turnbull wrote:
Random832 writes:
Is there any particular objection to allowing the backslash-space escape (and for escapes that mean whitespace characters, such as \t, \x20, to not split, if you meant to imply that they do)? That would provide the extra push to this being beneficial over split().
You're suggesting that (1) most escapes would be processed after splitting while (2) backslash-space (what about backslash-tab?) would be treated as an escape during splitting?
I don't understand what this "after splitting" you're talking about is. It would be a single pass through the characters of the token, with space alone meaning "eat all whitespace, next string" and space in backslash state meaning "next character of current string is space", just as "t" alone means "next character of current string is letter t" and t in backslash state means "next character of current string is space". I mean, even the idea that there would be a separate "splitting step" at all makes no sense to me, this implies building an "un-split string" as if the w weren't present, processing escapes as part of that, and then parsing the resulting string in a second pass, which is something we don't do for r"..." and *shouldn't* do for f"..." If you insist on consistency, backslash-space can mean space *everywhere* [once we've gotten through the deprecation cycle of backslash-unknown inserting a literal backslash], just like "\'" works fine despite double quotes not requiring it. As for backslash-tab, we already have \t. Maybe you'd like \s better for space.
I also have an alternate idea: sl{word1 word2 'string 3' "string 4"}
word1 and word2 are what perl would term "barewords"? Ie treated as strings?
The name "sl" was meant to evoke shlex (the syntax itself was also inspired by perl's qw{...} though perl doesn't provide any way of escaping whitespace). And I also meant this as a launching-off point for a general suggestion of word{ ... } as a readable syntax that doesn't collide with any currently valid constructs, for new kinds of literals (e.g. frozenset{a, b, c} and so on) So the result would be, more or less, the sequence that shlex.split('''word1 word2 'string 3' "string 4"''') gives.
-1 to w"", -1 to inconsistent interpretation of escapes, and -1 to a completely new syntax.
" ", "\x20", "\u0020", and "\U00000020" currently are different representations of the same string, so it would be confusing if the same notations meant different things in this context.
"'" and "\x39" (etc) are representations of the same string, but '...\x39 doesn't act as an end quote. Unescaped whitespace within a w"" literal would be *syntax*, not *content*. (Whereas in a regular literal backslash is syntax but in a r'...' literal it's content)
Another syntax plus overloading standard string notation with yet another semantics (strings, rawstrings) doesn't seem like a win to me.
As I accept the usual Pythonic aversion to mere abbreviations, I don't see any benefit to these notations, except for the case where a list just won't do, so you can avoid a call to tuple. We already have three good ways to do this:
wordlist = ["word1", "word2", "string 3", "string 4"] wordlist = "word1,word2,string 3,string 4".split(",") wordlist = open(word_per_line_file).readlines()
and for maximum Unicode-conforming generality with compact notation:
wordlist = "word1\UFFFFword2\UFFFFstring 3\UFFFFstring 4".split("\UFFFF")
You and I have very different definitions of the word "compact". In fact, this is *so obviously* non-compact that I find it hard to believe that you're being serious, but I don't think the joke's very funny if it's intended as one.
More seriously, in most use cases there will be ASCII control characters that you could use, which most editors can enter (though they might be visually unattractive in many editors, eg, \x0C).
The point of using space is readability. (The point of returning a tuple is to avoid the disadvantage that the list returned by split must be built at runtime and can't be loaded as a constant, or perhaps turned into a frozenset constant by the optimizer in cases like "if x in w'foo bar baz':".
Random832 writes:
I don't understand what this "after splitting" you're talking about is. It would be a single pass through the characters of the token,
Which may as well be thought of as a string (not a str). Although you can implement this process in one pass, you can also think of it in terms of two passes that give the same result. I suspect many people will think in terms of two passes, and I certainly do. Steven d'Aprano appears to, as well (he also used the "before splitting" terminology). Of course, he may find "the implementation will be single pass" persuasive, even though I don't.
You and I have very different definitions of the word "compact". In fact, this is *so obviously* non-compact
I used \u notation to ensure that people would understand that the separator is a non-character. (Emacs allows me to enter it, and with my current font it displays an empty box. I could fiddle with my PYTHONIOENCODING to use some sort of escape error handler to make it convenient, but I won't use w"" anyway so the point is sort of moot.)
(The point of returning a tuple is to avoid the disadvantage that the list returned by split must be built at runtime and can't be loaded as a constant, or perhaps turned into a frozenset constant by the optimizer in cases like "if x in w'foo bar baz':".
That's true, but where's the use case where that optimization matters?
On Tue, Dec 06, 2016 at 04:01:24PM -0500, Random832 wrote:
On Sat, Nov 12, 2016, at 13:05, Steven D'Aprano wrote:
I'm rather luke-warm on this proposal, although I might be convinced to support it if:
- w'...' unconditionally split on any whitespace (possibly excluding NBSP);
- and normal escapes worked.
Is there any particular objection to allowing the backslash-space escape (and for escapes that mean whitespace characters, such as \t, \x20, to not split, if you meant to imply that they do)?
I hadn't actually considered the question of whether w-strings should split before, or after, applying the escapes. (Or if I had, it was so long ago that I forgot what I decided.) I suppose there's no good reason for them to apply before splitting. I cannot think of any reason why you would write: w"Nobody expects the Spanish\x20Inquisition!" expecting to split "Spanish" and "Inquisition!". It's easier to just press the spacebar. So let's suppose that escapes are processed after the string is split, so that the w-string above becomes: ['Nobody', 'expects', 'the', 'Spanish Inquisition!'] Do we still need a new "\ " escape for a literal string? We clearly don't *need* it, since the user can write \x20 or \o40 or even '\N{SPACE}'. I'm *moderately* against it, since its hard to spot escaped spaces in a forest of unescaped ones, or vice versa: # example from the OP songs = w'My\ Bloody\ Valentine Blue\ Suede\ Shoes' I think that escaping spaces like that will be an attractive nuisance. I had to read the OP's example three times before I noticed that the space between Valentine and Blue was not escaped. What about ordinary strings? What is 'spam\ eggs'? It could be: - allow the escape and return 'spam eggs', even though it is pointless; - disallow the escape, and raise an exception, even though that's inconsistent with w-strings. I'm not really happy with either of those solutions (although I'm slightly less unhappy with the first). So in order of preference, least to worst: strong opposition -1 to the original proposal of w-strings with no escapes except for \space; weak opposition -0.25 for w-strings where \space behaves differently (raises an exception) in regular strings; mildly negative indifference -0 for w-strings with \space allowed in regular strings as well; mildly positive approval +0 for w-strings without bothering to allow \space at all (the user can use \x20 or equivalent). For the avoidance of doubt, by \space I mean a backslash followed by a literal space character.
That would provide the extra push to this being beneficial over split().
True, but it's not a lot of extra value over split(). If Python had this feature, I'd probably use it, but since it doesn't, I cannot in fairness ask somebody else to do the work on the basis that it is needed. I still think the existing solutions are Good Enough: - use split when you don't have space in any term: "fe fi fo fum".split() - use a list of manually split terms when you care about spaces: ['spam and eggs', 'cheese', 'tomato']
I also have an alternate idea: sl{word1 word2 'string 3' "string 4"}
Why "sl"? That looks like a set or a dict. Its bad enough that w-strings return a list, but to have "sl-sets" return a list is just weird :-) -- Steve
On Tue, Dec 6, 2016, at 20:03, Steven D'Aprano wrote:
I also have an alternate idea: sl{word1 word2 'string 3' "string 4"}
Why "sl"?
Well, shlex was one of the inspirations.
That looks like a set or a dict. Its bad enough that w-strings return a list, but to have "sl-sets" return a list is just weird :-)
My idea was to have word{...} as a grand unifying solution for "we want a new kind of literal but can't think of a syntax for it that doesn't either look like grit on the screen or already means something", with this as one of the first examples. I think it's better than using word"..." for things that aren't strings.
On 12 November 2016 at 17:01, Gary Godfrey <g.pythonideas@wamp.us> wrote:
Hi,
This is a little more readable, but still a bit ugly. What I'm proposing here is:
mydf = df[ w'field1 field2 field3' ]
This would be identical in all ways (compile-time) to:
mydf = df[ ('field1', 'field2', 'field3') ]
If using a tuple as an index expression, wouldn't it be ok for you to use: mydf = df['field1', 'field2', 'field3'] ? That should be equivalent to the second example, but without the doble bracketing Best, D. -- Daniel F. Moisset - UK Country Manager www.machinalis.com Skype: @dmoisset
participants (6)
-
Daniel Moisset
-
David Mertz
-
Gary Godfrey
-
Random832
-
Stephen J. Turnbull
-
Steven D'Aprano