Currently, Python only has ~ (tilde) in the context of a unary operation (like `-`, with __neg__(self), and `+`, __pos__(self)). `~` currently calls `__invert__(self)` in the unary context. I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.: smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) With a binary context for ~, we could write the above string as pure Python, with implications for symbolic evaluation (with SymPy) and statistical modelling (such as with sklearn or statsmodels) - and other use-cases/DSLs. In LaTeX we call this `\sim` (Wikipedia indicates this is for "similar to"). I'm not too particular, but `__sim__(self, other)` would have the benefits of being both short and consistent with LaTeX. This is not a fully baked idea, perhaps there's a good reason we haven't added a binary `~`. It seems like I've seen discussion in the past. But I couldn't find such discussion. And as I'm currently taking some statistics courses, I'm getting R-feature-envy again... What do you think? Aaron Hall
23.02.20 23:51, Aaron Hall via Python-ideas пише:
This is not a fully baked idea, perhaps there's a good reason we haven't added a binary `~`. It seems like I've seen discussion in the past. But I couldn't find such discussion. And as I'm currently taking some statistics courses, I'm getting R-feature-envy again...
Sorry, but I did not understand what this operator does, except that it has some relation to R.
For the "-" operators, unary and binary operators are related. For an number x
-x == 0 - x
Is there similar relation between unary and binary "~"?
~x == 0 ~ x
I guess "~" is a bitwise NOR operator (or Peirce's arrow):
x ~ y == ~x & ~y
I have no behavior for integers in mind. I would expect high-level libraries to want to implement behavior for it.
- sympy - pandas, numpy, sklearn, statsmodels - other mathematically minded libraries (monadic bind or compose?)
To do this we need a name. I like `__sim__`. Then we'll need `__rsim__` and `__isim__` for completeness. We need to make room for it in the grammar. Is it ok to give it the same priority of evaluation as `+` or `-`, or slightly higher?
In the past we've made additions to the language when we've been parsing and evaluating strings. That's what we're currently doing in statsmodels right now because we lack the binary (in the sense of two-arguments) `~`.
Aside from all the other problems, 'sim' (~) in LaTeX and mathematics means something completely different than 'depends on' (~) in R. Trying to overload those meaning makes everything harder.
I would recommend doing what NumPy did for many years for matrix multiply. Use an existing operator. Yes '*' had a different meaning for matrices vs. arrays, but that was enough to motivate the eventual addition of the '@' operate.
In particular, I don't think this looks at all bad:
Lottery // Literacy + Wealth + Region
If anything, the slash for "ratio" is more intuitive than the tilde in R. I'm wrestling with single versus double slash.
But if you add this to libraries with no language change needed, and it eventually becomes popular, maybe there is a later argument for a separate operator.
On Sun, Feb 23, 2020, 7:30 PM Aaron Hall via Python-ideas < python-ideas@python.org> wrote:
I have no behavior for integers in mind. I would expect high-level libraries to want to implement behavior for it.
- sympy
- pandas, numpy, sklearn, statsmodels
- other mathematically minded libraries (monadic bind or compose?)
To do this we need a name. I like `__sim__`. Then we'll need `__rsim__` and `__isim__` for completeness. We need to make room for it in the grammar. Is it ok to give it the same priority of evaluation as `+` or `-`, or slightly higher?
In the past we've made additions to the language when we've been parsing and evaluating strings. That's what we're currently doing in statsmodels right now because we lack the binary (in the sense of two-arguments) `~`.
See: https://www.statsmodels.org/dev/example_formulas.html _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JWC4HJ... Code of Conduct: http://python.org/psf/codeofconduct/
Thanks for the feedback, David. Sources that demonstrate that "sim" is the wrong semantic would be very much appreciated.
I chose "sim" because it's the same name and usual top usual result for an infixed tilde in LaTeX. And note that there is an implied relationship between the two sides in the context of a regression. Here are my sources:
LaTeX definition: "∼ Similar, in a relation" - https://latexref.xyz/Math-symbols.html#index-_005csim
"depends on" isn't used in the R documentation for `~`, it says: "Tilde Operator Tilde is used to separate the left- and right-hand sides in a model formula." see: - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/tilde
"An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators." - https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/formula
My main goal, here, again, is to open up the language to make, what I have encountered in multiple domains, `object1 ~ object2`, possible in Python.
In mathematics, in my recollection, the tilde is used for
1. Unary approximate number 2. Binary equivalence 3. Binary congruence/isomorphism
The last is more formally an equal sign with tilde on top: ≅. I think maybe just simplified for chalk boards where context makes it clear. Those are all akin to "similar"
... But they are all very different from what R does. "Depends on" is my description of R. I may have seen it elsewhere, but I don't know if there is a standard name for the symbol in R (other than 'tilde'). I really don't know of any other domain where this means dependent vs. independent variable. Statsmodels just borrows R because it shares users.
On Sun, Feb 23, 2020, 11:13 PM Aaron Hall via Python-ideas < python-ideas@python.org> wrote:
Thanks for the feedback, David. Sources that demonstrate that "sim" is the wrong semantic would be very much appreciated.
I chose "sim" because it's the same name and usual top usual result for an infixed tilde in LaTeX. And note that there is an implied relationship between the two sides in the context of a regression. Here are my sources:
LaTeX definition: "∼ Similar, in a relation"
"depends on" isn't used in the R documentation for `~`, it says: "Tilde Operator Tilde is used to separate the left- and right-hand sides in a model formula." see:
"An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators."
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/formula
My main goal, here, again, is to open up the language to make, what I have encountered in multiple domains, `object1 ~ object2`, possible in Python. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VJDKNX... Code of Conduct: http://python.org/psf/codeofconduct/
The context for this is statistics , so I'll quote Wolfram on tilde in the context of statistics: http://mathworld.wolfram.com/Tilde.html
"In statistics, the tilde is frequently used to mean "has the distribution (of)," for instance, X∼N(0,1) means "the stochastic (random) variable X has the distribution N(0,1) (the standard normal distribution). If X and Y are stochastic variables then X∼Y means "X has the same distribution as Y."
So X~Y is an assertion of a relationship between X and Y. Sympy has an entire module filled with these distributions. But maybe it's more useful to say `Z = Normal('Z', 0, 1)` instead of `Z = Z ~ Normal(0, 1)` or maybe `Z ~= Normal(0, 1)` (Z example from docs, latter tilde examples are mine).
When we say, in R or Patsy, `y ~ x1 + x2`, we are asserting a relationship between y and the x's. This instantiates a model/formula (in math, usually written y = x1 + x2 + e) that, given the asserted relationship and some assumptions, we can use to find the mathematical relationship between the variables.
I think our biggest concern is, what Guido earlier alluded to here, is, is the operator precedence correct? In R (and Patsy) the binding is the weakest. In Python, my first inclination is to make it the strongest so we could coalesce with the `y` object the other variables. But maybe this is wrong. Maybe it should be a weak binding. Maybe we can't make it weak because some people think it should have context in integers where it binds strongly. Maybe Patsy is the right way to do it. Maybe this is ultimately a bad idea. But I want a record of the discussion and conclusion, and I want it to be the best reasoned one we can muster.
Aaron Hall wrote:
The context for this is statistics , so I'll quote Wolfram on tilde in the context of statistics: http://mathworld.wolfram.com/Tilde.html "In statistics, the tilde is frequently used to mean "has the distribution (of)," for instance, X∼N(0,1) means "the stochastic (random) variable X has the distribution N(0,1) (the standard normal distribution). If X and Y are stochastic variables then X∼Y means "X has the same distribution as Y."
I think that you have refuted your own idea. You have argued that ~ is rightful statistical operator. But Python is not an statistical language. Python is a general purpose programming language while R is a statistical one. They have different domains so what is useful and right in R it is not necessary useful and right in Python. I cannot see a case for a statistical operator in Python.
Well... also, the meaning in R is quite a bit different from any of the meanings suggested by Wolfram. In fact, although the most common use in R is "depends on", it's technically just a generic delayed evaluation without any inherent semantics at all. Or, that is to say, tilde is just a certain kind of quotation, and we already have quotation in Python.
On Mon, Feb 24, 2020 at 1:28 PM jdveiga@gmail.com wrote:
Aaron Hall wrote:
The context for this is statistics , so I'll quote Wolfram on tilde in
the context of
statistics: http://mathworld.wolfram.com/Tilde.html "In statistics, the tilde is frequently used to mean "has the
distribution (of)," for
instance, X∼N(0,1) means "the stochastic (random) variable X has the
distribution N(0,1)
(the standard normal distribution). If X and Y are stochastic variables
then X∼Y means "X
has the same distribution as Y."
I think that you have refuted your own idea. You have argued that ~ is rightful statistical operator. But Python is not an statistical language. Python is a general purpose programming language while R is a statistical one. They have different domains so what is useful and right in R it is not necessary useful and right in Python. I cannot see a case for a statistical operator in Python. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IZMGSV... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Feb 24, 2020 at 11:00 AM David Mertz mertz@gnosis.cx wrote:
Well... also, the meaning in R is quite a bit different from any of the meanings suggested by Wolfram. In fact, although the most common use in R is "depends on", it's technically just a generic delayed evaluation without any inherent semantics at all. Or, that is to say, tilde is just a certain kind of quotation, and we already have quotation in Python.
Hm, that's actually an interesting take. Can you compare it to the kind of "quoting" that happens in a lambda? Is there some kind of translation of the OP's original example (Lottery ~ Literacy + Wealth + Region) to a lambda involving those words?
24.02.20 22:02, Guido van Rossum пише:
Hm, that's actually an interesting take. Can you compare it to the kind of "quoting" that happens in a lambda? Is there some kind of translation of the OP's original example (Lottery ~ Literacy + Wealth + Region) to a lambda involving those words?
I think that a named function is more appropriate than a lambda, because we need also the name of the output parameter:
def Lottery(Literacy, Wealth, Region):
And the most known application of such technique is fixtures in pytest.
I may have led in that direction, and I know R only passingly, not well. But my understanding is that thinking of a data structure that gets parsed by an evaluator, e.g. "do a linear regression with this structure (and a DataFrame)" is better than a lambda.
I'm sure it's possible to describe this with a function, but the Patsy documentation provides something that is probably more helpful:
from patsy import ModelDesc, Term, EvalFactorform1 = ModelDesc([Term([EvalFactor("y")])], [Term([]), Term([EvalFactor("a")]), Term([EvalFactor("a"), EvalFactor("b")]), Term([EvalFactor("np.log(x)")]) ])
Compare to what you get from parsing the above formula:
form2 = ModelDesc.from_formula("y ~ a + a:b + np.log(x)")
So given those two equivalent structures, we might call, identically:
decision_tree(form1, data=my_df) decision_tree(form2, data=my_df)
I think if I were writing a high-level data structure rather than a simple parse tree, I might do something more like:
{'dependent': ['y'], 'independent': ['a', Combine('a', 'b'), np.log(x)]}
But whatever the exact structure, basically the syntax just is a mini-language to say which names go in which structural places for an evaluator.
On Mon, Feb 24, 2020 at 3:43 PM Serhiy Storchaka storchaka@gmail.com wrote:
24.02.20 22:02, Guido van Rossum пише:
Hm, that's actually an interesting take. Can you compare it to the kind of "quoting" that happens in a lambda? Is there some kind of translation of the OP's original example (Lottery ~ Literacy + Wealth + Region) to a lambda involving those words?
I think that a named function is more appropriate than a lambda, because we need also the name of the output parameter:
def Lottery(Literacy, Wealth, Region):
And the most known application of such technique is fixtures in pytest. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/226FMN... Code of Conduct: http://python.org/psf/codeofconduct/
I am no expert on R, but R lazily evaluates arguments to functions; see https://cran.r-project.org/doc/manuals/r-devel/R-lang.html#Argument-evaluati... (plus the rest of that page, which is the language spec). Tilde is strictly used for modeling. Also relevant would be the operator precedence https://cran.r-project.org/doc/manuals/r-devel/R-lang.html#Infix-and-prefix-... - note that ~ has low precedence, which makes sense in how it is used.
We have a very straightforward way of lazily evaluating these formulas in Python, not to mention get the correct precedence - encapsulate as a string and use something like Patsy to parse!
On Mon, Feb 24, 2020 at 11:59 AM David Mertz mertz@gnosis.cx wrote:
Well... also, the meaning in R is quite a bit different from any of the meanings suggested by Wolfram. In fact, although the most common use in R is "depends on", it's technically just a generic delayed evaluation without any inherent semantics at all. Or, that is to say, tilde is just a certain kind of quotation, and we already have quotation in Python.
On Mon, Feb 24, 2020 at 1:28 PM jdveiga@gmail.com wrote:
Aaron Hall wrote:
The context for this is statistics , so I'll quote Wolfram on tilde in
the context of
statistics: http://mathworld.wolfram.com/Tilde.html "In statistics, the tilde is frequently used to mean "has the
distribution (of)," for
instance, X∼N(0,1) means "the stochastic (random) variable X has the
distribution N(0,1)
(the standard normal distribution). If X and Y are stochastic variables
then X∼Y means "X
has the same distribution as Y."
I think that you have refuted your own idea. You have argued that ~ is rightful statistical operator. But Python is not an statistical language. Python is a general purpose programming language while R is a statistical one. They have different domains so what is useful and right in R it is not necessary useful and right in Python. I cannot see a case for a statistical operator in Python. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IZMGSV... Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6DQC4A... Code of Conduct: http://python.org/psf/codeofconduct/
The biggest problem remains that 99% of your explanation (and that of others who seem to understand what you want) uses the words of the application domain (statistics, stochastic variables, distributions) in a way that is unhelpful to convey your needs to those who are in a position to implement and support your proposal.
You have had enough opportunities to explain this better, but you've not taken them. That's okay, I don't expect you to give a lecture on stochastic variables and distributions. But it does mean there is a chasm between you (and apparently other users of Simpy and Patsy) and the core developers that seems impossible to bridge.
So I think there's little point in continuing.
On Mon, Feb 24, 2020 at 9:49 AM Aaron Hall via Python-ideas < python-ideas@python.org> wrote:
The context for this is statistics , so I'll quote Wolfram on tilde in the context of statistics: http://mathworld.wolfram.com/Tilde.html
"In statistics, the tilde is frequently used to mean "has the distribution (of)," for instance, X∼N(0,1) means "the stochastic (random) variable X has the distribution N(0,1) (the standard normal distribution). If X and Y are stochastic variables then X∼Y means "X has the same distribution as Y."
So X~Y is an assertion of a relationship between X and Y. Sympy has an entire module filled with these distributions. But maybe it's more useful to say `Z = Normal('Z', 0, 1)` instead of `Z = Z ~ Normal(0, 1)` or maybe `Z ~= Normal(0, 1)` (Z example from docs, latter tilde examples are mine).
When we say, in R or Patsy, `y ~ x1 + x2`, we are asserting a relationship between y and the x's. This instantiates a model/formula (in math, usually written y = x1 + x2 + e) that, given the asserted relationship and some assumptions, we can use to find the mathematical relationship between the variables.
I think our biggest concern is, what Guido earlier alluded to here, is, is the operator precedence correct? In R (and Patsy) the binding is the weakest. In Python, my first inclination is to make it the strongest so we could coalesce with the `y` object the other variables. But maybe this is wrong. Maybe it should be a weak binding. Maybe we can't make it weak because some people think it should have context in integers where it binds strongly. Maybe Patsy is the right way to do it. Maybe this is ultimately a bad idea. But I want a record of the discussion and conclusion, and I want it to be the best reasoned one we can muster. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4H7ARX... Code of Conduct: http://python.org/psf/codeofconduct/
On 2/24/2020 2:59 PM, Guido van Rossum wrote:
The biggest problem remains that 99% of your explanation (and that of others who seem to understand what you want) uses the words of the application domain (statistics, stochastic variables, distributions) in a way that is unhelpful to convey your needs to those who are in a position to implement and support your proposal.
My problem with the proposal is that it's basically: "we want an operator to add some application-specific functionality using dunders". That's fine, I'd like that too! But the problem is that order to add an operator we need to specify the associativity and precedence. These can't be answered without knowing how the operator will fit into the problem domain (see [1] for the same discussion about the '@' operator). So now we're adding an operator that's specific to one problem domain, and the decisions we make might make it not be super useful elsewhere. And who's to say that the given problem domain is even that important? It's not like we've done a survey to figure out what code is being held back for want of a new operator: it's just that this proposal popped up here first.
Eric
[1] https://www.python.org/dev/peps/pep-0465/#precedence-and-associativity
Yes, at this point I think I must concede the battle. I thought I had discussed this with core developers in the past, but I couldn't find the results. So here we have it - making `~` a binary operator as well is a no-go.
Perhaps some better advocate than I will take it up in the future.
Thank you all for your discussion! Cheers!
ACH
On Mon, 24 Feb 2020 at 17:47, Aaron Hall via Python-ideas python-ideas@python.org wrote:
The context for this is statistics , so I'll quote Wolfram on tilde in the context of statistics: http://mathworld.wolfram.com/Tilde.html
"In statistics, the tilde is frequently used to mean "has the distribution (of)," for instance, X∼N(0,1) means "the stochastic (random) variable X has the distribution N(0,1) (the standard normal distribution). If X and Y are stochastic variables then X∼Y means "X has the same distribution as Y."
So X~Y is an assertion of a relationship between X and Y. Sympy has an entire module filled with these distributions. But maybe it's more useful to say `Z = Normal('Z', 0, 1)` instead of `Z = Z ~ Normal(0, 1)` or maybe `Z ~= Normal(0, 1)` (Z example from docs, latter tilde examples are mine).
Speaking as a SymPy contributor it isn't clear to me what you expect tilde would be used for in SymPy (although I'm sure it would get used for lots of things if it existed). The problem with the two examples shown is that they require Z to exist already but if you need to create Z you might as well just create it as Z=Normal(...).
Something that stands out to me in what you've said is that X~Y is a "relationship between X and Y". In all mathematical contexts I know of ~ is a relation rather than an operation. In fact ~ is often used as *the* generic symbol for talking about relations in the abstract e.g.: https://en.wikipedia.org/wiki/Equivalence_relation#Definition
That would suggest ~ has the same precedence class as relational operators like <, >, == etc rather than arithmetic operators like +, -, *, so we should have X + Y ~ C being equivalent to (X + Y) ~ C. Relating that back to SymPy I'd expect `Z ~ Normal(0, 1)` to be a Boolean statement that could be used in some way. As a Boolean statement (X ~ Y) + Z would be meaningless because + is meaningless for Booleans so the other precedence would be more useful.
The other implication of being a relational operator in Python is being usable in chaining like X < Y ~ Z.
-- Oscar
I can imagine the hypothetical binary tilde being pretty for some kind of equivalence. This is definitely not enough to motivate me to actually want to add it. But I think this would read OK as equivalent:
numpy.allclose(arr1, arr2) arr1 ~ arr2
However, the problem is that there are lots of other ways of being equivalent other than having elements that are all close. In the examples shown by Oscar and Aaron, we might ask whether two collections are drawn from the same distribution. That a useful question. But would this tilde mean the t-test? Unpaired or paired? Welch's t-test. Or maybe it should be a Kolmogorov-Smirnov test. Or a Shapiro-Wilk test? Or Wilcoxon's signed-rank test? Maybe Mann-Whitney's U test?
The R approach with a "formula" is just a quotation of the thing we might test later, by whatever method. But as I've said, we already have quotes. So code like this seems fine to me:
formula = "arr1 ~ arr2" ks_test(formula) welch_t(formula)
However, once I've written that, it seem to have little point not simply to pass in the collections themselves to the various "equivalence" functions.
24.02.20 02:27, Aaron Hall via Python-ideas пише:
I have no behavior for integers in mind. I would expect high-level libraries to want to implement behavior for it.
- sympy
- pandas, numpy, sklearn, statsmodels
- other mathematically minded libraries (monadic bind or compose?)
To do this we need a name. I like `__sim__`. Then we'll need `__rsim__` and `__isim__` for completeness. We need to make room for it in the grammar. Is it ok to give it the same priority of evaluation as `+` or `-`, or slightly higher?
In the past we've made additions to the language when we've been parsing and evaluating strings. That's what we're currently doing in statsmodels right now because we lack the binary (in the sense of two-arguments) `~`.
I still have no idea what this operator does.
You can use a function or a method:
sim(Lottery, Literacy + Wealth + Region) Lottery.sim(Literacy + Wealth + Region)
or
sim(Lottery, Literacy) + Wealth + Region Lottery.sim(Literacy) + Wealth + Region
Operators are used to simplify the code because virtually all know what "+" and "/" mean. The same operators have similar meaning in different programming languages and not only. And they are often used many times in a row, like `a + b/(c+d)`.
I don't see what's wrong with the status quo:
smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
If I understand correctly you want to use instead:
smf.ols(formula=df.Lottery ~ df.Literacy + df.Wealth + df.Region)
Or since some people favor indexing over attribute access for column names:
smf.ols(formula=df['Lottery'] ~ df['Literacy'] + df['Wealth'] + df['Region'])
Both alternatives are much more verbose since you have to repeat the `df` part or even worse the brackets for indexing. In any case you need to type the column names that you would like to include and there's no auto-complete on column names that would help you typing it. So I don't see what's the benefit of the operator version.
In addition this requires Pandas to implement the modeling but there's much more to Pandas than just modeling so perhaps that better remains a separate project.
On 24.02.20 01:27, Aaron Hall via Python-ideas wrote:
I have no behavior for integers in mind. I would expect high-level libraries to want to implement behavior for it.
- sympy
- pandas, numpy, sklearn, statsmodels
- other mathematically minded libraries (monadic bind or compose?)
To do this we need a name. I like `__sim__`. Then we'll need `__rsim__` and `__isim__` for completeness. We need to make room for it in the grammar. Is it ok to give it the same priority of evaluation as `+` or `-`, or slightly higher?
In the past we've made additions to the language when we've been parsing and evaluating strings. That's what we're currently doing in statsmodels right now because we lack the binary (in the sense of two-arguments) `~`.
See: https://www.statsmodels.org/dev/example_formulas.html _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JWC4HJ... Code of Conduct: http://python.org/psf/codeofconduct/
Hi Aaron, and welcome!
Your proposal would be a lot more interesting to me if I knew what this binary ~ would actually do, without having to go learn R or LaTeX.
You say:
I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.:
smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
With a binary context for ~, we could write the above string as pure Python
I'm confused. Why can't you just write
'Lottery ~ Literacy + Wealth + Region'
as a literal string? That's an exact copy and paste from your example, and it works for me.
On 2020-02-23 14:38, Steven D'Aprano wrote:
Hi Aaron, and welcome!
Your proposal would be a lot more interesting to me if I knew what this binary ~ would actually do, without having to go learn R or LaTeX.
You say:
I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.:
smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
With a binary context for ~, we could write the above string as pure Python
I'm confused. Why can't you just write
'Lottery ~ Literacy + Wealth + Region'
as a literal string? That's an exact copy and paste from your example, and it works for me.
I'm not the OP but. . .
In R there is a tilde operator that is used to indicate "depends on" when separating the dependent and independent variables in a statistical model formulation. The example given is how it has to be done in Python. In R you just write `Lottery ~ Literacy + Wealth + Region` (i.e., as code with no quotes).
That said, the way this works in R depends on additional "features" of R whose absence in Python make it a heavier lift than just adding a tilde. R can magically defer evaluation of names so that you can write something like that tilde expression and pass an additional argument specifying the table whose columns are the given variables (i.e., a table with columns for Lottery, Literacy, etc.), and then it will later evaluate the names by looking them up as columns. This won't work in Python because even if you had the tilde, you couldn't do this:
ols(Lottery ~ Literacy + Wealth + Region, data=df)
Because that model expression is a function argument, Python semantics require it to be evaluated before the call is made, so you can't defer evaluation and later use the names as column names to look up in the provided table.
In order to make it work you'd need something else that I've sometimes wished for, which is a smooth way to create and pass around unevaluated expressions, and then later trigger their evaluation in the context of a given namespace (such as the one where the evaluation is triggered). Right now the only approximation to this is lambda, but lambda closes over variables based on the lexical context where it's defined, not where it's called, so it doesn't really work. In other words, what I'd like is the ability to do something like this:
def foo(): expr = deferred(a + b + c) bar(expr)
def bar(x): a, b, c = 1, 2, 3
# this should return 6 return expr.evaluate() If such functionality existed, then a tilde operator could indeed be used to create model definitions using deferred evaluations like in R.
However, I think deferred evaluation is the more important functionality here. If we had deferred evaluation without the tilde, we could still do what R does by using a different operator instead of tilde, at worst perhaps having to parenthesize the dependent-variable expression (in case our alternative "depends" operator had the wrong precedence). But without deferred evaluation, the tilde operator gains little, at least in terms of providing model-evaluation expressions like those in R.
(Apologies for the html email, it was poorly formatted, making my example very difficult to follow. So let me try to give better examples.)
With sympy we would be able to create meaningful behavior for:
``` from sympy import symbols
y, x1, x2 = symbols('y x1 x2')
model = y ~ x1 + x2
model.is_linear() # -> True/False? (just an example!) ```
or
``` from sympy import symbols
theta, N, mu, sigma = symbols('theta N mu sigma')
distribution = theta ~ N(mu, sigma) distribution.sample() # -> a symbolic sampling from the normal distribution ```
and for dataframes, arrays, or matrices, we could do something like:
``` model = df.y ~ df.x1 + df.x2 model.fit() test_predictions = model.predict(df_test) ```
I would assume strict evaluation and minimal computation done to return whatever "model" object we get back from the operation.
But as it stands, we can't use the operator, as it doesn't support a binary operation. To Serhiy's point (a different reply to my first post), `__sim__(self, other)` could be so implemented for integers, but that's not the domain I am particularly interested in. I consider this akin to the adoption of `@` for `__matmul__(self, other)`. We would not implement behavior for it, we would merely provide language support the `a ~ b` usage.
- The name "sim" is taken from LaTeX. I think it would be beneficial to stick to it. - The semantic usage comes from R and the field of statistics.
I do not intend to suggest we use R's evaluation model though - names would still need to exist in the namespace and would call the `__sim__` method of the first object (or `__rsim__` on the second). (To be complete, we'd also want `__isim__` for in-place, `a ~= b` but I don't currently have a strong motivating use-case for it.)
On 2020-02-23 15:58, Aaron Hall via Python-ideas wrote:
(Apologies for the html email, it was poorly formatted, making my example very difficult to follow. So let me try to give better examples.)
With sympy we would be able to create meaningful behavior for:
from sympy import symbols y, x1, x2 = symbols('y x1 x2') model = y ~ x1 + x2 model.is_linear() # -> True/False? (just an example!)
or
from sympy import symbols theta, N, mu, sigma = symbols('theta N mu sigma') distribution = theta ~ N(mu, sigma) distribution.sample() # -> a symbolic sampling from the normal distribution
and for dataframes, arrays, or matrices, we could do something like:
model = df.y ~ df.x1 + df.x2 model.fit() test_predictions = model.predict(df_test)
The thing is that you can already do that. You just can't do it with the specific character ~ for your operation. But that's not really a big deal is it? You could define this proposed behavior right now, using some other operator like `<<` and then make it so you can do
model = (df.y << df.x1) + df.x2
and so on. Also, since there's typically only one dependent variable, you could just define types that have a method `.depends_on` or `.modeled_by` or whatever, and an accessor attribute .term or something that so that you do
model = df.y.modeled_by(df.x1.term + df.x2)
I don't think think this is really much worse than R. In fact, I'd rather see something like "modeled_by" rather than the tilde. The problem, again, is not really the lack of the tilde, but the need to repeatedly specify `df`. I'm curious why you see the tilde as the crux of this issue, because it seems to me that the behavior you envision could already be implemented with another operator; it wouldn't look exactly like it does in R, but it could function pretty similarly.
Hello,
Can't you use eval()?
This return eval(expr)
instead of return expr.evaluate()
Best regards,
João Matos
On 23/02/2020 23:04, Brendan Barnwell wrote:
On 2020-02-23 14:38, Steven D'Aprano wrote:
Hi Aaron, and welcome!
Your proposal would be a lot more interesting to me if I knew what this binary ~ would actually do, without having to go learn R or LaTeX.
You say:
I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.:
smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
With a binary context for ~, we could write the above string as pure Python
I'm confused. Why can't you just write
'Lottery ~ Literacy + Wealth + Region'
as a literal string? That's an exact copy and paste from your example, and it works for me.
I'm not the OP but. . .
In R there is a tilde operator that is used to indicate "depends on" when separating the dependent and independent variables in a statistical model formulation. The example given is how it has to be done in Python. In R you just write `Lottery ~ Literacy + Wealth + Region` (i.e., as code with no quotes).
That said, the way this works in R depends on additional "features" of R whose absence in Python make it a heavier lift than just adding a tilde. R can magically defer evaluation of names so that you can write something like that tilde expression and pass an additional argument specifying the table whose columns are the given variables (i.e., a table with columns for Lottery, Literacy, etc.), and then it will later evaluate the names by looking them up as columns. This won't work in Python because even if you had the tilde, you couldn't do this:
ols(Lottery ~ Literacy + Wealth + Region, data=df)
Because that model expression is a function argument, Python semantics require it to be evaluated before the call is made, so you can't defer evaluation and later use the names as column names to look up in the provided table.
In order to make it work you'd need something else that I've sometimes wished for, which is a smooth way to create and pass around unevaluated expressions, and then later trigger their evaluation in the context of a given namespace (such as the one where the evaluation is triggered). Right now the only approximation to this is lambda, but lambda closes over variables based on the lexical context where it's defined, not where it's called, so it doesn't really work. In other words, what I'd like is the ability to do something like this:
def foo(): expr = deferred(a + b + c) bar(expr)
def bar(x): a, b, c = 1, 2, 3
# this should return 6 return expr.evaluate()
If such functionality existed, then a tilde operator could indeed be used to create model definitions using deferred evaluations like in R.
However, I think deferred evaluation is the more important functionality here. If we had deferred evaluation without the tilde, we could still do what R does by using a different operator instead of tilde, at worst perhaps having to parenthesize the dependent-variable expression (in case our alternative "depends" operator had the wrong precedence). But without deferred evaluation, the tilde operator gains little, at least in terms of providing model-evaluation expressions like those in R.
On 2020-02-23 16:00, João Matos wrote:
Hello,
Can't you use eval()?
This return eval(expr)
instead of return expr.evaluate()
You can use eval if expr is a string, but then you have the same problems that were mentioned in a recent thread about "SQL strings": it's not syntax-highlighted, syntax errors aren't caught until you eval it, etc. You can't use eval for the case I described because there isn't any Python object like the unevaluated expression that I called `expr`.
I still can't follow this explanation.
Operators in Python have a priority, which determines order of evaluation.
In the example
A ~ B + C
is this equivalent to
A ~ (B + C)
or to
(A ~ B) + C
???
It's not unheard of to add an operator to Python that's primarily important for a certain subcommunity -- in particular, we've done this for @ (__matmul__).
But the motivation as well as the specifics of the proposal must be understandable for people outside that subcommunity. You could look at PEP 465 for some hints on how to do this.
Assuming that the reader is familiar with the example `Lottery ~ Literacy + Wealth + Region` is *not* going to work. I have literally no idea from what field that is taken or what the purpose of the example is. Please don't expect that I can just Google it: I did, found https://www.statsmodels.org/stable/example_formulas.html, and I still have no idea what it's about.
On Sun, Feb 23, 2020 at 3:07 PM Brendan Barnwell brenbarn@brenbarn.net wrote:
On 2020-02-23 14:38, Steven D'Aprano wrote:
Hi Aaron, and welcome!
Your proposal would be a lot more interesting to me if I knew what this binary ~ would actually do, without having to go learn R or LaTeX.
You say:
I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.:
smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
With a binary context for ~, we could write the above string as pure Python
I'm confused. Why can't you just write
'Lottery ~ Literacy + Wealth + Region'
as a literal string? That's an exact copy and paste from your example, and it works for me.
I'm not the OP but. . . In R there is a tilde operator that is used to indicate "depends
on" when separating the dependent and independent variables in a statistical model formulation. The example given is how it has to be done in Python. In R you just write `Lottery ~ Literacy + Wealth + Region` (i.e., as code with no quotes).
That said, the way this works in R depends on additional
"features" of R whose absence in Python make it a heavier lift than just adding a tilde. R can magically defer evaluation of names so that you can write something like that tilde expression and pass an additional argument specifying the table whose columns are the given variables (i.e., a table with columns for Lottery, Literacy, etc.), and then it will later evaluate the names by looking them up as columns. This won't work in Python because even if you had the tilde, you couldn't do this:
ols(Lottery ~ Literacy + Wealth + Region, data=df)
Because that model expression is a function argument, Python
semantics require it to be evaluated before the call is made, so you can't defer evaluation and later use the names as column names to look up in the provided table.
In order to make it work you'd need something else that I've
sometimes wished for, which is a smooth way to create and pass around unevaluated expressions, and then later trigger their evaluation in the context of a given namespace (such as the one where the evaluation is triggered). Right now the only approximation to this is lambda, but lambda closes over variables based on the lexical context where it's defined, not where it's called, so it doesn't really work. In other words, what I'd like is the ability to do something like this:
def foo(): expr = deferred(a + b + c) bar(expr)
def bar(x): a, b, c = 1, 2, 3
# this should return 6 return expr.evaluate() If such functionality existed, then a tilde operator could indeed
be used to create model definitions using deferred evaluations like in R.
However, I think deferred evaluation is the more important
functionality here. If we had deferred evaluation without the tilde, we could still do what R does by using a different operator instead of tilde, at worst perhaps having to parenthesize the dependent-variable expression (in case our alternative "depends" operator had the wrong precedence). But without deferred evaluation, the tilde operator gains little, at least in terms of providing model-evaluation expressions like those in R.
-- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FU5P4Q... Code of Conduct: http://python.org/psf/codeofconduct/
Guido, thank you so much for your kind review.
I think I would prefer ``` (A ~ B) + C ``` as it would first create a coalescing object to which it knows C is added, and this is the usual way it is used.
I believe Sympy could handle it easily either way, but dataframes/arrays less well so (since addition is defined for these objects and would return the sums before the calling of `~`).
The example cited *could* work so long as Lottery points to an object that understands `~`, and the other names point to objects that Lottery is compatible with.
But this is just the general case from an implementation agnostic point of view.
I would expect the following sorts of usages:
- sympy (for models and distributions) - pandas, numpy, sklearn, statsmodels (for statistical models and functions) - other mathematically minded libraries (monadic bind or compose?)
I will look more closely at PEP 465. Should I write up a PEP?
Aaron,
It's too soon to start drafting a PEP. However you need to get at least one core dev to understand your proposal well enough that they will act as a *sponsor* for your proposal. Once you have a willing sponsor you can then put it forward in PEP form. See PEP 1 for PEP sponsorship when the author is not a core dev. (And sorry, no, I'm not going to be your sponsor.)
In the meantime, now that you have explained your preferred priority for `~`, perhaps you can also explain what kind of behavior you would like to use this for? Listing a number of libraries that "could use this" isn't enough. Is there an existing function in one of the libraries you mention that has the desired behavior (in the context of that library)? That would help.
--Guido
On Sun, Feb 23, 2020 at 4:58 PM Aaron Hall via Python-ideas < python-ideas@python.org> wrote:
Guido, thank you so much for your kind review.
I think I would prefer
(A ~ B) + C
as it would first create a coalescing object to which it knows C is added, and this is the usual way it is used.
I believe Sympy could handle it easily either way, but dataframes/arrays less well so (since addition is defined for these objects and would return the sums before the calling of `~`).
The example cited *could* work so long as Lottery points to an object that understands `~`, and the other names point to objects that Lottery is compatible with.
But this is just the general case from an implementation agnostic point of view.
I would expect the following sorts of usages:
- sympy (for models and distributions)
- pandas, numpy, sklearn, statsmodels (for statistical models and
functions)
- other mathematically minded libraries (monadic bind or compose?)
I will look more closely at PEP 465. Should I write up a PEP? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JCR2IJ... Code of Conduct: http://python.org/psf/codeofconduct/
Is there an existing function in one of the libraries you mention that has the desired behavior (in the context of that library)? That would help.
Yes, and we are currently parsing and evaluating strings to convey the meaning.
Patsy is currently used by statsmodels to parse strings in the following way:
``` from patsy import ModelDesc, Term, EvalFactor ModelDesc([Term([EvalFactor("y")])], [Term([]), Term([EvalFactor("a")]), Term([EvalFactor("a"), EvalFactor("b")]), Term([EvalFactor("np.log(x)")]) ]) ``` "Compare to what you get from parsing the above formula:"
``` ModelDesc.from_formula("y ~ a + a:b + np.log(x)") ```
In the past when we have been eval'ing strings, we added functionality so users could avoid it (`getattr`, et. al.).
Sympy is rather new, but I think they'd appreciate it since they have an entire subpackage for distributions:
https://docs.sympy.org/latest/modules/stats.html
I do envision other usages, but these are the strongest cases I have right now.
Are you a contributor or core dev on one of those packages? That would be useful context to have.
Also, did you see Jim Baker's module? How do you propose to translate Patsy's ':' operator?
On Sun, Feb 23, 2020 at 5:22 PM Aaron Hall via Python-ideas < python-ideas@python.org> wrote:
Is there an existing function in one of the libraries you mention that has the desired behavior (in the context of that library)? That
would
help.
Yes, and we are currently parsing and evaluating strings to convey the meaning.
Patsy is currently used by statsmodels to parse strings in the following way:
from patsy import ModelDesc, Term, EvalFactor ModelDesc([Term([EvalFactor("y")])], [Term([]), Term([EvalFactor("a")]), Term([EvalFactor("a"), EvalFactor("b")]), Term([EvalFactor("np.log(x)")]) ])
"Compare to what you get from parsing the above formula:"
ModelDesc.from_formula("y ~ a + a:b + np.log(x)")
In the past when we have been eval'ing strings, we added functionality so users could avoid it (`getattr`, et. al.).
Sympy is rather new, but I think they'd appreciate it since they have an entire subpackage for distributions:
https://docs.sympy.org/latest/modules/stats.html
I do envision other usages, but these are the strongest cases I have right now. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/MZFA26... Code of Conduct: http://python.org/psf/codeofconduct/
I am not yet a contributor to either, I have an invitation to contribute to Sympy, and I have contributed to CPython, and I would like to be a contributor on all of these libraries
(You might know me from my answers on Stack Overflow, and I gave the slots talk at PyCon 2017 where we met - now I'm also in weekly meetings with my employer to open up and green-light FOSS contributions for employees - it's really only fair as we use FOSS extensively...)
My first inclination is to translate Patsy's `a:b` to `interaction(a, b)`, but to me that's a tertiary concern.
My main goal here is to increase the flexibility of Python for various domains where I have used `object0 ~ object1` - and can't yet do so in Python.
On 2020-02-23 19:35, Aaron Hall via Python-ideas wrote:
My main goal here is to increase the flexibility of Python for various domains where I have used `object0 ~ object1` - and can't yet do so in Python.
You've said something like this a couple times in this thread, but personally I don't feel that that is a meaningful goal. Based on your phrasing, my impression is you have seen this particular symbol in other places, and your goal is to get that particular symbol into Python so you can use it. But as far as I can tell that's kind of pointless.
Python has plenty of operators. You haven't given any rationale for why you want this operator, other than that you want it to be shaped like a tilde. That's not much of a justification! There are lots of symbols out there in the world. Python doesn't need to adopt all of them based on their appearance. The question is what *behavior* you're proposing for the symbol, and in fact you haven't proposed any. You've given examples of how it could be used to create formulas akin to those in R, but those formulas can already be created if you just chose an existing operator that doesn't happen to look like a tilde. Why does it matter that it looks like a tilde?
On 2020-02-23 16:32, Guido van Rossum wrote:
Assuming that the reader is familiar with the example `Lottery ~ Literacy + Wealth + Region` is *not* going to work. I have literally no idea from what field that is taken or what the purpose of the example is. Please don't expect that I can just Google it: I did, found https://www.statsmodels.org/stable/example_formulas.html, and I still have no idea what it's about.
Sorry, perhaps I should have given a bit more explanation. As I said, "~" means "depends on". So in R, you do something like:
model = some_statistical_model_function(Lottery ~ Literacy + Wealth + Region, some_data_table)
This means "make a model that predicts the value of Lottery based on the values of Literacy, Wealth and Region", where the names Lottery, Literacy etc. refer to columns in some_data_table, which is a tabular data structure akin to a pandas DataFrame. So, again, `Lottery ~ Literacy + Wealth + Region` means "Lottery depends on Literacy, Wealth, and Region". It doesn't really matter what names we use, we can use "A ~ B + C" just as well; the point is it is defining a relationship between variables whose measurements we have as columns in a tabular structure, and it means that we want a model where the variables on the right of the tilde are the independent variables and the one on the left is the dependent variable. "Y ~ X" means "predict Y using X".
As you mentioned (in a part of your response I snipped) the precedence of the operator is important. In this case we would want the operator to have very low precedence, because we want it to mean `Lottery ~ (Literacy + Wealth + Region)` --- that is, that the independent variable may depend on some complicated expression involving combinations of the dependent variables.
It's also worth noting that the tilde here isn't notation for any of the work that the statistical model does. It's just a way of writing a "formula" that relates the independent and dependent variables, but you still have to pass that formula to some function that actually runs the model.
All that said, given that we can already achieve the desired precedence with parentheses, I'll reiterate that I don't think the tilde is a real blocker to doing this kind of model specification with Python expressions, so I don't think I'm in favor of this proposal as it is.
On 24/02/2020 00:59, Brendan Barnwell wrote:
On 2020-02-23 16:32, Guido van Rossum wrote:
Assuming that the reader is familiar with the example `Lottery ~ Literacy + Wealth + Region` is *not* going to work. I have literally no idea from what field that is taken or what the purpose of the example is. Please don't expect that I can just Google it: I did, found https://www.statsmodels.org/stable/example_formulas.html, and I still have no idea what it's about.
Sorry, perhaps I should have given a bit more explanation. As I said, "~" means "depends on". So in R, you do something like:
model = some_statistical_model_function(Lottery ~ Literacy + Wealth + Region, some_data_table)
[snippety snip]
It's also worth noting that the tilde here isn't notation for any of the work that the statistical model does. It's just a way of writing a "formula" that relates the independent and dependent variables, but you still have to pass that formula to some function that actually runs the model.
All that said, given that we can already achieve the desired precedence with parentheses, I'll reiterate that I don't think the tilde is a real blocker to doing this kind of model specification with Python expressions, so I don't think I'm in favor of this proposal as it is.
This seems a lot like trying to shoehorn something in so one can write idiomatic R in Python. That on the whole sounds like a bad idea; a friend of mine use to say he could write FORTRAN in any language but no one else could read it. Wouldn't it be more pythonic (or more accurately anything-other-than-R-ic) to use an interface that was more like
model = model_fn(prediction, seq_of_predictors, data_table)
On 2/24/20 7:24 AM, Rhodri James wrote:
This seems a lot like trying to shoehorn something in so one can write idiomatic R in Python. That on the whole sounds like a bad idea; a friend of mine use to say he could write FORTRAN in any language but no one else could read it. Wouldn't it be more pythonic (or more accurately anything-other-than-R-ic) to use an interface that was more like
I have seen many times someone in a forum for language X say that they have used language Y and it has this nifty feature that they wish language X had, so lets modify language X so it is more like language Y. Quite often language X and Y are very different languages with different purposes. My feeling is that if you really want the features that language Y provides, you probably should be programming in language Y, not some very different language X. Perhaps it makes sense to ask for a better integration between language X and Y, so you can call a procedure written in one from the other (but we need to realize that sometimes this might not be possible as they have very different run time requirements).
I don't know if this comes from external requirements, someone tells the Y programmer to write a program in X, or they find that the language Y isn't well supported in some environment, but language X is.
Either way, making language X look more like Y is rarely the right answer. If the matter is more appearance (you can do it but it looks very different) then perhaps the right answer is to really learn language X and get used to writing in its style (or just keep writing in language Y, but USE language Y). If language X is really missing some capability of language Y, then if the capability really is within the wheelhouse of what language X should be able to do, then adding it to the language might make sense, but it doesn't need to look the same as language Y if that notation doesn't make sense in X. It should also be remembered that not all languages need to be good at all things, and each language has a set of things it does well, and you don't want to hurt that set to add something new. Also, all languages (well, maybe almost all) are Turing complete, so anything that you can do in one, it possible in the other, it just might not look pretty or be simple.
On 24.02.20 13:24, Rhodri James wrote:
This seems a lot like trying to shoehorn something in so one can write idiomatic R in Python. That on the whole sounds like a bad idea; a friend of mine use to say he could write FORTRAN in any language but no one else could read it. Wouldn't it be more pythonic (or more accurately anything-other-than-R-ic) to use an interface that was more like
model = model_fn(prediction, seq_of_predictors, data_table)
The problem here is that the `seq_of_predictors` doesn't include a way for specifying their relationship with `prediction`, i.e. one cannot (easily) distinguish
P ~ X + Y + Z
versus
P ~ X * Y + Z
Aaron Hall wrote:
Currently, Python only has ~ (tilde) in the context of a unary operation (like -, with __neg__(self), and +, __pos__(self)). ~ currently calls __invert__(self) in the unary context. I think it would be awesome to have in the language, as it would allow modelling along the lines of R that we currently only get with text, e.g.: smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) With a binary context for ~, we could write the above string as pure Python, with implications for symbolic evaluation (with SymPy) and statistical modelling (such as with sklearn or statsmodels) - and other use-cases/DSLs. In LaTeX we call this \sim (Wikipedia indicates this is for "similar to"). I'm not too particular, but __sim__(self, other) would have the benefits of being both short and consistent with LaTeX. This is not a fully baked idea, perhaps there's a good reason we haven't added a binary ~. It seems like I've seen discussion in the past. But I couldn't find such discussion. And as I'm currently taking some statistics courses, I'm getting R-feature-envy again... What do you think? Aaron Hall
I really do not fully understand your proposal. I do not know nothing about R and my statistical knowledge has gone long ago.
However, I think that we cannot expect that Python accommodates every existing domain. Let me explain: Python have not special features, syntax, operators to deal with SQL, HTML, ini files, OpenGL, etc. These domains, and others, are supported via libraries, outside of the language core.
~ exists in bit-wise context and, as long as I know, it comes from C --I have never used it indeed in Python. It is a unary operator because it works in that way as a bitwise operator.
I cannot see any improvement in becoming ~ into a binary operator. I imagine that a binary ~ would have a completely different meaning from a unary ~. I can foresee many problems here.
In my opinion, you should prove that binary ~ has a relevant benefit for the whole language, not just for R tasks. It should be useful in some different domains and behave consistently --or at least so consistent as possible-- in those domains.
Can you, for instance, envision other uses of binary ~ beyond R?
Supporting ~ as a binary operator is an interesting idea, especially given the relatively limited usage of unary ~. However, the big hole in this proposal for formulas is that there is a de facto standard "minilanguage" for writing such formulas in Python, namely what Patsy supports: https://patsy.readthedocs.io/en/latest/formulas.html Patsy is used by statsmodels and other tools to support the same formulas as we see in R (or S).
We immediately see the problem with the interaction operator : (colon), which conflicts with how it is used to support annotations in Python. Given that this formula minilanguage is comprehensive, this seems to be a fatal objection.
Note that deferred evaluation is sort of a red herring - it is straightforward to defer execution in Python's object model, as we see in SymPy, Pandas dataframes where clauses, and SQLAlchemy, among other examples.
On Sun, Feb 23, 2020 at 5:37 PM jdveiga@gmail.com wrote:
Aaron Hall wrote:
Currently, Python only has ~ (tilde) in the context of a unary operation
(like
-, with __neg__(self), and +, __pos__(self)). ~ currently calls __invert__(self) in the unary context. I think it would be awesome to have in the language, as it would allow
modelling along the
lines of R that we currently only get with text, e.g.: smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) With a binary context for ~, we could write the above string as pure
Python, with
implications for symbolic evaluation (with SymPy) and statistical
modelling (such as with
sklearn or statsmodels) - and other use-cases/DSLs. In LaTeX we call this \sim (Wikipedia indicates this is for "similar
to").
I'm not too particular, but __sim__(self, other) would have the benefits
of
being both short and consistent with LaTeX. This is not a fully baked idea, perhaps there's a good reason we haven't
added a binary
~. It seems like I've seen discussion in the past. But I couldn't find
such
discussion. And as I'm currently taking some statistics courses, I'm
getting
R-feature-envy again... What do you think? Aaron Hall
I really do not fully understand your proposal. I do not know nothing about R and my statistical knowledge has gone long ago.
However, I think that we cannot expect that Python accommodates every existing domain. Let me explain: Python have not special features, syntax, operators to deal with SQL, HTML, ini files, OpenGL, etc. These domains, and others, are supported via libraries, outside of the language core.
~ exists in bit-wise context and, as long as I know, it comes from C --I have never used it indeed in Python. It is a unary operator because it works in that way as a bitwise operator.
I cannot see any improvement in becoming ~ into a binary operator. I imagine that a binary ~ would have a completely different meaning from a unary ~. I can foresee many problems here.
In my opinion, you should prove that binary ~ has a relevant benefit for the whole language, not just for R tasks. It should be useful in some different domains and behave consistently --or at least so consistent as possible-- in those domains.
Can you, for instance, envision other uses of binary ~ beyond R? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/UMXBIM... Code of Conduct: http://python.org/psf/codeofconduct/
Jim, thanks for your feedback. I didn't intend for this to address the interaction term syntax.
As you can see the R language has several ways of representing the same information:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html
I would prefer to write, e.g.: `y ~ interaction(a, b) + np.log(x)`
The proposal is to expand unary `~` to binary `~` where appropriate in the context, thereby making the language more flexible, especially for mathematical purposes.
Colons are a third-order concern, and I assert they are rarely used in these circumstances regardless.
This seems like a feature looking for a use-case.
I don't think the use case presented is good enough, since in this thread I've seen it given three incompatible operator priorities (higher than `+`, lower than `+`, and one that seems to be a sort of assignment?) and a handful of loosely related, but different, meanings.
Perhaps just as troubling is that the in-place variant `~=`: - ...reads to me like one of the boolean comparison operators ("almost equal"?). - ...if I understand correctly, has no use in the R semantics that the proposal is based on.
So right out of the gate, it's confusing and broken.
I think something like this needs to be a fully-baked (or at least mostly-baked) idea before discussion can be productive. If a few popular packages plan on using it (not "could" use it; I mean the maintainers of those packages actually come out in support of the proposal), then there's a good use case for it.
But as it stands, I see very little upside for the comparatively large downside.
Brandt