Re: [Python-ideas] Proposal: Query language extension to Python (PythonQL)

No, the current solution is temporary because we just don’t have the manpower to implement the full thing: a real system that will rewrite parts of PythonQL queries and ship them to underlying databases. We need a real query optimizer and smart wrappers for this purpose. But we’ll build one of these for demo purposes soon (either a Spark wrapper or a PostgreSQL wrapper). One thought, if you're lacking in manpower now, then proposing inclusion into core Python means that the core dev team will be taking on an additional chunk of code that is already under-resourced. That rings alarm bells for me - how would you imagine the work needed to merge PythonQL into the core Python grammar would be resourced? I should say that in practice, I think that the solution is relatively niche, and overlaps quite significantly with existing Python features, so I don't really see a compelling case for inclusion. The parallel with C# and LINQ is interesting here - LINQ is a pretty cool technology, but I don't see it in widespread use in general-purpose C#
On 25 March 2017 at 11:24, Pavel Velikhov <pavel.velikhov at gmail.com> wrote: projects (disclaimer: I don't get to see much C# code, so my
experience is limited). I see lots of C# code, but (thankfully) not so much LINQ to SQL. Yes, it is a cool technology. But I sometimes have a problem with the SQL it generates. Since I'm also a SQL developer, I'm sensitive to how queries are constructed, for performance reasons, as well as how they look, for readability and aesthetic reasons. LINQ queries can generate poorly-performing SQL, since LINQ is a basically a translator, but not an AI. As far as appearances go, LINQ queries can look pretty gnarly, especially if they include sub queries or a few joins. That makes it hard for the SQL dev (me!) to read and understand if there are performance problems (which there often are, in my experience) So, I would tend to code the SQL separately and put it in a SQL view, function or stored procedure. I can still parse the results with LINQ (not LINQ to SQL), which is fine. For similar reasons, I'm not a huge fan of ORMs either. Probably my bias towards designing the database first and building up queries to meet the business goals before writing a line of Python, C#, or the language de jour. -- Gerald Britton, MCSE-DP, MVP LinkedIn Profile: http://ca.linkedin.com/in/geraldbritton

First off, I think PythonQL (and PonyORM before it) is a very interesting piece of technology. However, I think some of the answers so far suggest we may need to discuss a couple of meta-issues around target audiences and available technical options before continuing on. I'm quoting Gerald's post here because it highlights the "target audience" problem, but my comments apply to the thread generally. On 25 March 2017 at 22:51, Gerald Britton <gerald.britton@gmail.com> wrote:
I see lots of C# code, but (thankfully) not so much LINQ to SQL. Yes, it is a cool technology. But I sometimes have a problem with the SQL it generates. Since I'm also a SQL developer, I'm sensitive to how queries are constructed, for performance reasons, as well as how they look, for readability and aesthetic reasons.
LINQ queries can generate poorly-performing SQL, since LINQ is a basically a translator, but not an AI. As far as appearances go, LINQ queries can look pretty gnarly, especially if they include sub queries or a few joins. That makes it hard for the SQL dev (me!) to read and understand if there are performance problems (which there often are, in my experience)
So, I would tend to code the SQL separately and put it in a SQL view, function or stored procedure. I can still parse the results with LINQ (not LINQ to SQL), which is fine.
For similar reasons, I'm not a huge fan of ORMs either. Probably my bias towards designing the database first and building up queries to meet the business goals before writing a line of Python, C#, or the language de jour.
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing. There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities. However, at the same time, PythonQL *is* a DSL for data manipulation operations, and map and filter are far and away the most common of those. Even reduce, which was previously a builtin, was pushed into functools for Python 3.0, with the preferred alternative being to just write a suitably named function that accepts an iterable and returns a single value. And while Python is a very popular tool for data manipulation, it would be a big stretch to assume that that was it's primary use case in all contexts. So it makes sense to review some of the technical options that are available to help make projects like PythonQL more maintainable, without necessarily gating improvements to them on the relatively slow update and rollout cycle of new Python versions. = Option 1 = Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system. So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common. The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features. = Option 2 = Back when f-strings were added for 3.6, I wrote PEP 501 to generalise the idea as "i-strings": exposing the intermediate interpolated form of f-strings, such that you could write code like `myquery = sql(i"SELECT {column} FROM {table};")` where the "sql" function received an "InterpolationTemplate" object that it could render however it wanted, but the "column" and "table" references were just regular Python expressions. It's currently deferred indefinitely, as I didn't have any concrete use cases that Guido found sufficiently compelling to make the additional complexity worthwhile. However, given optionally delayed rendering of interpolated strings, PythonQL could be used in the form: result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """) I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that). Option 3: Go all the way to expanding comprehensions to natively be a full data manipulation DSL. I'm personally not a fan of that approach, as syntax is really hard to search for help on (keywords are better for that than punctuation, but not by much), while methods and functions get to have docstrings. It also means the query language gets tightly coupled to the Python grammar, which not only makes the query language difficult to update, but also makes Python's base syntax harder for new users to learn. By contrast, when DSLs are handled as interpolation templates with delayed rendering, then the rendering function gets to provide runtime documentation, and the definition of the DSL is coupled to the update cycle of the rendering function, *not* that of the Python language definition. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick, Thanks for such a detailed response!
On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
First off, I think PythonQL (and PonyORM before it) is a very interesting piece of technology. However, I think some of the answers so far suggest we may need to discuss a couple of meta-issues around target audiences and available technical options before continuing on.
I'm quoting Gerald's post here because it highlights the "target audience" problem, but my comments apply to the thread generally.
On 25 March 2017 at 22:51, Gerald Britton <gerald.britton@gmail.com> wrote:
I see lots of C# code, but (thankfully) not so much LINQ to SQL. Yes, it is a cool technology. But I sometimes have a problem with the SQL it generates. Since I'm also a SQL developer, I'm sensitive to how queries are constructed, for performance reasons, as well as how they look, for readability and aesthetic reasons.
LINQ queries can generate poorly-performing SQL, since LINQ is a basically a translator, but not an AI. As far as appearances go, LINQ queries can look pretty gnarly, especially if they include sub queries or a few joins. That makes it hard for the SQL dev (me!) to read and understand if there are performance problems (which there often are, in my experience)
So, I would tend to code the SQL separately and put it in a SQL view, function or stored procedure. I can still parse the results with LINQ (not LINQ to SQL), which is fine.
For similar reasons, I'm not a huge fan of ORMs either. Probably my bias towards designing the database first and building up queries to meet the business goals before writing a line of Python, C#, or the language de jour.
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing.
Actually I myself am a user of PythonQL, even though I’m an SQL expert. I work in data science, so I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with. Some things like nested data models are also much better handled by PythonQL, and data like JSON or XML will also be easier to handle. I see even more use-cases coming up, once we get further with smart database wrappers in PythonQL.
There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities.
Don’t know, for example one of the wrappers is going to be an Apache Spark wrappers, so you could quickly hack up a PythonQL query that would be run on a distributed platform.
However, at the same time, PythonQL *is* a DSL for data manipulation operations, and map and filter are far and away the most common of those. Even reduce, which was previously a builtin, was pushed into functools for Python 3.0, with the preferred alternative being to just write a suitably named function that accepts an iterable and returns a single value. And while Python is a very popular tool for data manipulation, it would be a big stretch to assume that that was it's primary use case in all contexts.
So it makes sense to review some of the technical options that are available to help make projects like PythonQL more maintainable, without necessarily gating improvements to them on the relatively slow update and rollout cycle of new Python versions.
= Option 1 =
Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system.
So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common.
The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features.
This would be great anyways, if we could rely on some preprocessor directive, instead of hacking encodings, this could be nice.
= Option 2 =
Back when f-strings were added for 3.6, I wrote PEP 501 to generalise the idea as "i-strings": exposing the intermediate interpolated form of f-strings, such that you could write code like `myquery = sql(i"SELECT {column} FROM {table};")` where the "sql" function received an "InterpolationTemplate" object that it could render however it wanted, but the "column" and "table" references were just regular Python expressions.
It's currently deferred indefinitely, as I didn't have any concrete use cases that Guido found sufficiently compelling to make the additional complexity worthwhile. However, given optionally delayed rendering of interpolated strings, PythonQL could be used in the form:
result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """)
I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that).
What would be the difference between this and just executing a PythonQL string for us, getting local and global variables into PythonQL scope?
Option 3:
Go all the way to expanding comprehensions to natively be a full data manipulation DSL.
I'm personally not a fan of that approach, as syntax is really hard to search for help on (keywords are better for that than punctuation, but not by much), while methods and functions get to have docstrings. It also means the query language gets tightly coupled to the Python grammar, which not only makes the query language difficult to update, but also makes Python's base syntax harder for new users to learn.
By contrast, when DSLs are handled as interpolation templates with delayed rendering, then the rendering function gets to provide runtime documentation, and the definition of the DSL is coupled to the update cycle of the rendering function, *not* that of the Python language definition.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing.
Actually I myself am a user of PythonQL, even though I’m an SQL expert. I work in data science, so I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with. Some things like nested data models are also much better handled by PythonQL, and data like JSON or XML will also be easier to handle.
So perhaps a better way of framing it would be to say that PythonQL aims to provide a middle ground between interfaces that are fully in "Python mode" (e.g ORMs, pandas DataFrames), where the primary interface is methods-on-objects, and those that are fully in "data manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs). At the Python level, success for PythonQL would look like people being able to seamlessly transfer their data manipulation skills from a Django ORM project to an SQL Alchemy project to a pandas analysis project to a distributed data analysis project in dask, without their data manipulation code really having to change - only the backing data structures and the runtime performance characteristics would differ. At the data manipulation layer, success for PythonQL would look like people being able to easily get "good enough" performance for one-off scripts, regardless of the backing data store, with closer attention to detail only being needed for genuinely large data sets (where efficiency matters even for one-off analyses), or for frequently repeated operations (where wasted CPU hours show up as increased infrastructure expenses).
There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities.
Don’t know, for example one of the wrappers is going to be an Apache Spark wrappers, so you could quickly hack up a PythonQL query that would be run on a distributed platform.
Right, I meant this in the same sense that folks using an ORM like SQL Alchemy may eventually hit a point where rather than trying to convince the ORM to emit the SQL they want to run, it's easier to just bypass the ORM layer and write the exact SQL they want. It's worthwhile attempting to reduce the number of cases where folks feel obliged to do that, but at the same time, abstraction layers need to hide at least some lower level details if they're going to actually work properly.
= Option 1 =
Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system.
So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common.
The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features.
This would be great anyways, if we could rely on some preprocessor directive, instead of hacking encodings, this could be nice.
Victor Stinner wrote up some ideas about that in PEP 511: https://www.python.org/dev/peps/pep-0511/ Preprocessing is one of the specific uses cases considered: https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
= Option 2 =
... given optionally delayed rendering of interpolated strings, PythonQL could be used in the form:
result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """)
I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that).
What would be the difference between this and just executing a PythonQL string for us, getting local and global variables into PythonQL scope?
The big new technical capability that f-strings introduced is that the compiler can see the variable references in the embedded expressions, so f-strings "just work" with closure references, whereas passing locals() and globals() explicitly is: 1. slow (since you have to generate a full locals dict); 2. incompatible with the use of closure variables (since they're not visible in either locals() *or* globals()) The i-strings concept takes that closure-compatible interpolation capability and separates it from the str.format based rendering step.
From a speed perspective, the interpolation aspects of this approach are so efficient they rival simple string concatenation:
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'first + second' ..................... Mean +- std dev: 71.7 ns +- 2.1 ns $ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'f"{first}{second}"' ..................... Mean +- std dev: 77.8 ns +- 2.5 ns Something like pyql that did more than just concatenate the text sections with the text values of the embedded expressions would still need some form of regex-style caching strategy to avoid parsing the same query string multiple times, but the Python interpreter would handle the task of breaking up the string into the text sections and the interpolated Python expressions. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick!
On 26 Mar 2017, at 18:02, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing.
Actually I myself am a user of PythonQL, even though I’m an SQL expert. I work in data science, so I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with. Some things like nested data models are also much better handled by PythonQL, and data like JSON or XML will also be easier to handle.
So perhaps a better way of framing it would be to say that PythonQL aims to provide a middle ground between interfaces that are fully in "Python mode" (e.g ORMs, pandas DataFrames), where the primary interface is methods-on-objects, and those that are fully in "data manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).
At the Python level, success for PythonQL would look like people being able to seamlessly transfer their data manipulation skills from a Django ORM project to an SQL Alchemy project to a pandas analysis project to a distributed data analysis project in dask, without their data manipulation code really having to change - only the backing data structures and the runtime performance characteristics would differ.
At the data manipulation layer, success for PythonQL would look like people being able to easily get "good enough" performance for one-off scripts, regardless of the backing data store, with closer attention to detail only being needed for genuinely large data sets (where efficiency matters even for one-off analyses), or for frequently repeated operations (where wasted CPU hours show up as increased infrastructure expenses).
Yes, more in this line. It is possible for us to provide decent-looking hints for query optimization and we are planning a sophisticated optimizer in the future, but especially in the beginning of the project this sounds quite fair.
There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities.
Don’t know, for example one of the wrappers is going to be an Apache Spark wrappers, so you could quickly hack up a PythonQL query that would be run on a distributed platform.
Right, I meant this in the same sense that folks using an ORM like SQL Alchemy may eventually hit a point where rather than trying to convince the ORM to emit the SQL they want to run, it's easier to just bypass the ORM layer and write the exact SQL they want.
It's worthwhile attempting to reduce the number of cases where folks feel obliged to do that, but at the same time, abstraction layers need to hide at least some lower level details if they're going to actually work properly.
= Option 1 =
Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system.
So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common.
The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features.
This would be great anyways, if we could rely on some preprocessor directive, instead of hacking encodings, this could be nice.
Victor Stinner wrote up some ideas about that in PEP 511: https://www.python.org/dev/peps/pep-0511/
Preprocessing is one of the specific uses cases considered: https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
= Option 2 =
... given optionally delayed rendering of interpolated strings, PythonQL could be used in the form:
result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """)
I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that).
What would be the difference between this and just executing a PythonQL string for us, getting local and global variables into PythonQL scope?
The big new technical capability that f-strings introduced is that the compiler can see the variable references in the embedded expressions, so f-strings "just work" with closure references, whereas passing locals() and globals() explicitly is:
1. slow (since you have to generate a full locals dict); 2. incompatible with the use of closure variables (since they're not visible in either locals() *or* globals())
The i-strings concept takes that closure-compatible interpolation capability and separates it from the str.format based rendering step.
From a speed perspective, the interpolation aspects of this approach are so efficient they rival simple string concatenation:
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'first + second' ..................... Mean +- std dev: 71.7 ns +- 2.1 ns
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'f"{first}{second}"' ..................... Mean +- std dev: 77.8 ns +- 2.5 ns
Something like pyql that did more than just concatenate the text sections with the text values of the embedded expressions would still need some form of regex-style caching strategy to avoid parsing the same query string multiple times, but the Python interpreter would handle the task of breaking up the string into the text sections and the interpolated Python expressions.
Thanks, will start following this proposal!
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing.
Actually I myself am a user of PythonQL, even though I’m an SQL expert.
I work in data science, so
I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with. Some things like nested data models are also much better handled by PythonQL, and data like JSON or XML will also be easier to handle.
So perhaps a better way of framing it would be to say that PythonQL aims to provide a middle ground between interfaces that are fully in "Python mode" (e.g ORMs, pandas DataFrames), where the primary interface is methods-on-objects, and those that are fully in "data manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).
At the Python level, success for PythonQL would look like people being able to seamlessly transfer their data manipulation skills from a Django ORM project to an SQL Alchemy project to a pandas analysis project to a distributed data analysis project in dask, without their data manipulation code really having to change - only the backing data structures and the runtime performance characteristics would differ.
e.g. Django ORM to SQLAlchemy: - Does this necessarily imply a metamodel for relations? - Django: GenericForeignKey - SQLAlchemy: sqlalchemy_utils.generic_relationship ...
At the data manipulation layer, success for PythonQL would look like people being able to easily get "good enough" performance for one-off scripts, regardless of the backing data store, with closer attention to detail only being needed for genuinely large data sets (where efficiency matters even for one-off analyses), or for frequently repeated operations (where wasted CPU hours show up as increased infrastructure expenses).
http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core (dask, blaze, odo, ) http://blaze.pydata.org/ - blaze - | Src: https://github.com/blaze/blaze - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-pandas.html - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-sql.html - | Docs: https://blaze.readthedocs.io/en/latest/backends.html - "Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark" - dask - | Src: https://github.com/dask/dask - | Docs: http://dask.pydata.org/en/latest/#familiar-user-interface ** - | Docs: http://dask.pydata.org/en/latest/scheduler-choice.html - http://xarray.pydata.org/en/stable/dask.html - odo - | Src: https://github.com/blaze/blaze#odo - | Docs: https://odo.readthedocs.io/en/latest/#formats - zero-copy - https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-mac... - https://github.com/alex/zero_buffer/blob/master/zero_buffer.py - ibis - | Src: https://github.com/cloudera/ibis - (a backend may compile to LLVM) - seeAlso: blaze, dask, "bulk synchronous parallel" - | Docs: http://docs.ibis-project.org/sql.html - | Docs: http://docs.ibis-project.org/tutorial.html "Expression tutortial" jupyter notebooks - | Docs: http://docs.ibis-project.org/ - Apache Impala (incubating) - Apache Kudu (incubating) - Hadoop Distributed File System (HDFS) - PostgreSQL (Experimental) - SQLite - [ SQLAlchemy: { ... } ] - | Src: https://github.com/cloudera/ibis/blob/master/ibis/sql/alchemy.py - apache beam - https://beam.apache.org/documentation/sdks/python/ - https://beam.apache.org/get-started/quickstart-py/ (pip install apache-beam) - https://beam.apache.org/documentation/sdks/pydoc/0.6.0/ - apache_beam.transforms - https://beam.apache.org/documentation/programming-guide/#transforms "Applying transforms" Somewhere in this list, these become big data tools.
There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities.
Don’t know, for example one of the wrappers is going to be an Apache Spark wrappers, so you could quickly hack up a PythonQL query that would be run on a distributed platform.
Right, I meant this in the same sense that folks using an ORM like SQL Alchemy may eventually hit a point where rather than trying to convince the ORM to emit the SQL they want to run, it's easier to just bypass the ORM layer and write the exact SQL they want.
At that point one can either: - reflect the tables/mappings at devtime - reflect the tables/mappings at runtime And then run the raw DBAPI query (using appropriate query interpolation (-> i-strings and scoped configuration state)): session.execute("SELECT dbapi_version FROM ?", "tbl;name")
It's worthwhile attempting to reduce the number of cases where folks feel obliged to do that, but at the same time, abstraction layers need to hide at least some lower level details if they're going to actually work properly.
= Option 1 =
Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system.
So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common.
The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features.
This would be great anyways, if we could rely on some preprocessor directive, instead of hacking encodings, this could be nice.
Victor Stinner wrote up some ideas about that in PEP 511: https://www.python.org/dev/peps/pep-0511/
Preprocessing is one of the specific uses cases considered: https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
= Option 2 =
... given optionally delayed rendering of interpolated strings, PythonQL could be used in the form:
result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """)
I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that).
What would be the difference between this and just executing a PythonQL string for us, getting local and global variables into PythonQL scope?
The big new technical capability that f-strings introduced is that the compiler can see the variable references in the embedded expressions, so f-strings "just work" with closure references, whereas passing locals() and globals() explicitly is:
1. slow (since you have to generate a full locals dict); 2. incompatible with the use of closure variables (since they're not visible in either locals() *or* globals())
The i-strings concept takes that closure-compatible interpolation capability and separates it from the str.format based rendering step.
From a speed perspective, the interpolation aspects of this approach are so efficient they rival simple string concatenation:
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'first + second' ..................... Mean +- std dev: 71.7 ns +- 2.1 ns
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'f"{first}{second}"' ..................... Mean +- std dev: 77.8 ns +- 2.5 ns
Something like pyql that did more than just concatenate the text sections with the text values of the embedded expressions would still need some form of regex-style caching strategy to avoid parsing the same query string multiple times, but the Python interpreter would handle the task of breaking up the string into the text sections and the interpolated Python expressions.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Mar 26, 2017 at 10:42 PM, Wes Turner <wes.turner@gmail.com> wrote:
On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, the target audience here *isn't* folks who already know how to construct their own relational queries in SQL, and it definitely isn't folks that know how to tweak their queries to get optimal performance from the specific database they're using. Rather, it's folks that already know Python's comprehensions, and perhaps some of the itertools features, and helping to provide them with a smoother on-ramp into the world of relational data processing.
Actually I myself am a user of PythonQL, even though I’m an SQL expert.
I work in data science, so
I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with. Some things like nested data models are also much better handled by PythonQL, and data like JSON or XML will also be easier to handle.
So perhaps a better way of framing it would be to say that PythonQL aims to provide a middle ground between interfaces that are fully in "Python mode" (e.g ORMs, pandas DataFrames), where the primary interface is methods-on-objects, and those that are fully in "data manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).
At the Python level, success for PythonQL would look like people being able to seamlessly transfer their data manipulation skills from a Django ORM project to an SQL Alchemy project to a pandas analysis project to a distributed data analysis project in dask, without their data manipulation code really having to change - only the backing data structures and the runtime performance characteristics would differ.
e.g. Django ORM to SQLAlchemy: - Does this necessarily imply a metamodel for relations? - Django: GenericForeignKey - SQLAlchemy: sqlalchemy_utils.generic_relationship
Does this necessarily imply a metamodel for relations?
Edges are expressed differently in different frameworks; ultimately you're looking at a projection of a graph (a constructed subset of a graph). So solving this in the general case implies solving for graphs (as well as (which includes) tree-based hierarchical data like SQL, arrays, documents, keys and values)). 1. Schema ("metamodel") 2. Query language Q: How can Linked Data help define a metamodel for expressing relations (in order to harmonize search of disparate datasets)? - It's a graph with schema constraints. - Use URIs for Classes ("types"), Properties ("columns", "attributes"), and instances with @ids ("rows") - RDF, RDFS, OWL - Search n databases asynchronously with SPARQL federation - Native SPARQL database (adapt the data) - SPARQL facade/frontend (adapt to an interface) - Define/generate a schema representation for arbitrary data sources which {query language} {implementation A} can use to plan data-local queries and transformations - JSONLD @context for data sources ### Property Relations are expressed as properties of class instances. rdf:Property schema:Property https://meta.schema.org/Property - https://meta.schema.org/inverseOf owl:inverseOf https://www.w3.org/TR/owl-ref/#inverseOf-def Q: > "How can you provide documentation about the columns in a CSV file?" https://www.w3.org/TR/tabular-data-primer/#documentation-columns A: CSVW as [JSONLD,] A: https://wrdrd.com/docs/consulting/linkedreproducibility#csv-csvw-and-metadat... - How do we make these work with various stores? - How do we include columnar metadata like physical units and precision in databases without support for it? - JSON-LD manifest? AFAIU, these don't handle relations: - http://datashape.pydata.org/ - https://github.com/apache/arrow/blob/master/format/Metadata.md Q: "How can you describe the schema for multi-dimensional datasets (with complex relations)?" A: https://www.w3.org/TR/vocab-data-cube/#data-cubes The relations are otherwise defined as RDFS/OWL (e.g. as JSON-LD). ## Graph queries ### SPARQL - SPARQL is a W3C Web Standard query language. - SPARQL is not the only graph query language. ### Blueprints, Gremlin Blueprints is a graph traversal/query API. - There are many blueprints API implementations (e.g. Rexster, Neo4j <https://wrdrd.com/docs/consulting/knowledge-engineering#neo4j>, Blazegraph <https://wrdrd.com/docs/consulting/knowledge-engineering#blazegraph>, Accumulo <https://wrdrd.com/docs/consulting/knowledge-engineering#accumulo>) Gremlin implements the Blueprints API (also in Python); it's also generic like LINQ (like JDBC for graph databases): https://tinkerpop.apache.org/docs/current/reference/#gremlin-python ### GraphQL https://github.com/graphql-python/ ... supporting relations across ORMs would be cool; with enough abstraction IDK why it wouldn't look like RDFS/OWL.
...
At the data manipulation layer, success for PythonQL would look like people being able to easily get "good enough" performance for one-off scripts, regardless of the backing data store, with closer attention to detail only being needed for genuinely large data sets (where efficiency matters even for one-off analyses), or for frequently repeated operations (where wasted CPU hours show up as increased infrastructure expenses).
http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core (dask, blaze, odo, )
- blaze - | Src: https://github.com/blaze/blaze - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-pandas.html - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-sql.html - | Docs: https://blaze.readthedocs.io/en/latest/backends.html - "Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark"
*URIs* https://blaze.readthedocs.io/en/latest/uri.html#what-sorts-of-uris-does-blaz... ``` What sorts of URIs does Blaze support? Paths to files on disk, including the following extensions - .csv - .json - .csv.gz/json.gz - .hdf5 (uses h5py) - .hdf5::/datapath - hdfstore://filename.hdf5 (uses special pandas.HDFStore format) - .bcolz - .xls(x) SQLAlchemy strings like the following - sqlite:////absolute/path/to/myfile.db::tablename - sqlite:////absolute/path/to/myfile.db (specify a particular table) - postgresql://username:password@hostname:port - impala://hostname (uses impyla) - anything supported by SQLAlchemy MongoDB Connection strings of the following form - mongodb://username:password@hostname:port/database_name::collection_name Blaze server strings of the following form - blaze://hostname:port (port defaults to 6363) In all cases when a location or table name is required in addition to the traditional URI (e.g. a data path within an HDF5 file or a Table/Collection name within a database) *then that information follows on the end of the URI after a separator of two colons ::.* How it works Blaze depends on the Odo library to handle URIs. URIs are managed through the resource function which is dispatched based on regular expressions. For example a simple resource function to handle .json files might look like the following (although Blaze’s actual solution is a bit more comprehensive): from blaze import resource import json @resource.register('.+\.json') def resource_json(uri): with open(uri): data = json.load(uri) return data Can I extend this to my own types? Absolutely. Import and extend *resource* as shown in the “How it works” section. The rest of Blaze will pick up your change automatically. ```
- dask - | Src: https://github.com/dask/dask - | Docs: http://dask.pydata.org/en/latest/#familiar-user-interface ** - | Docs: http://dask.pydata.org/en/latest/scheduler-choice.html - http://xarray.pydata.org/en/stable/dask.html
- odo - | Src: https://github.com/blaze/blaze#odo - | Docs: https://odo.readthedocs.io/en/latest/#formats
- zero-copy - https://www.slideshare.net/wesm/memory-interoperability- in-analytics-and-machine-learning - https://github.com/alex/zero_buffer/blob/master/zero_buffer.py
- ibis - | Src: https://github.com/cloudera/ibis - (a backend may compile to LLVM) - seeAlso: blaze, dask, "bulk synchronous parallel" - | Docs: http://docs.ibis-project.org/sql.html - | Docs: http://docs.ibis-project.org/tutorial.html "Expression tutortial" jupyter notebooks - | Docs: http://docs.ibis-project.org/ - Apache Impala (incubating) - Apache Kudu (incubating) - Hadoop Distributed File System (HDFS) - PostgreSQL (Experimental) - SQLite - [ SQLAlchemy: { ... } ] - | Src: https://github.com/cloudera/ibis/blob/master/ibis/sql/ alchemy.py
- apache beam - https://beam.apache.org/documentation/sdks/python/ - https://beam.apache.org/get-started/quickstart-py/ (pip install apache-beam) - https://beam.apache.org/documentation/sdks/pydoc/0.6.0/ - apache_beam.transforms - https://beam.apache.org/documentation/programming-guide/#transforms "Applying transforms"
Somewhere in this list, these become big data tools.
There's no question that folks dealing with sufficiently large data sets with sufficiently stringent performance requirements are eventually going to want to reach for handcrafted SQL or a distributed computation framework like dask, but that's not really any different from our standard position that when folks are attempting to optimise a hot loop, they're eventually going to have to switch to something that can eliminate the interpreter's default runtime object management overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an extension module in a different language entirely). It isn't an argument against making it easier for folks to postpone the point where they find it necessary to reach for the "something else" that takes them beyond Python's default capabilities.
Don’t know, for example one of the wrappers is going to be an Apache Spark wrappers, so you could quickly hack up a PythonQL query that would be run on a distributed platform.
Right, I meant this in the same sense that folks using an ORM like SQL Alchemy may eventually hit a point where rather than trying to convince the ORM to emit the SQL they want to run, it's easier to just bypass the ORM layer and write the exact SQL they want.
At that point one can either: - reflect the tables/mappings at devtime - reflect the tables/mappings at runtime
And then run the raw DBAPI query (using appropriate query interpolation (-> i-strings and scoped configuration state)):
session.execute("SELECT dbapi_version FROM ?", "tbl;name")
It's worthwhile attempting to reduce the number of cases where folks feel obliged to do that, but at the same time, abstraction layers need to hide at least some lower level details if they're going to actually work properly.
= Option 1 =
Fully commit to the model of allowing alternate syntactic dialects to run atop Python interpreters. In Hylang and PythonQL we have at least two genuinely interesting examples of that working through the text encoding system, as well as other examples like Cython that work through the extension module system.
So that's an opportunity to take this from "Possible, but a bit hacky" to "Pluggable source code translation is supported at all levels of the interpreter, including debugger source maps, etc" (perhaps by borrowing ideas from other ecosytems like Java, JavaScript, and .NET, where this kind of thing is already a lot more common.
The downside of this approach is that actually making it happen would be getting pretty far afield from the original PythonQL goal of "provide nicer data manipulation abstractions in Python", and it wouldn't actually deliver anything new that can't already be done with existing import and codec system features.
This would be great anyways, if we could rely on some preprocessor directive, instead of hacking encodings, this could be nice.
Victor Stinner wrote up some ideas about that in PEP 511: https://www.python.org/dev/peps/pep-0511/
Preprocessing is one of the specific uses cases considered: https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
= Option 2 =
... given optionally delayed rendering of interpolated strings, PythonQL could be used in the form:
result =pyql(i""" (x,y) for x in {range(1,8)} for y in {range(1,7)} if x % 2 == 0 and y % 2 != 0 and x > y """)
I personally like this idea (otherwise I wouldn't have written PEP 501 in the first place), and the necessary technical underpinnings to enable it are all largely already in place to support f-strings. If the PEP were revised to show examples of using it to support relatively seamless calling back and forth between Hylang, PythonQL and regular Python code in the same process, that might be intriguing enough to pique Guido's interest (and I'm open to adding co-authors that are interested in pursuing that).
What would be the difference between this and just executing a PythonQL string for us, getting local and global variables into PythonQL scope?
The big new technical capability that f-strings introduced is that the compiler can see the variable references in the embedded expressions, so f-strings "just work" with closure references, whereas passing locals() and globals() explicitly is:
1. slow (since you have to generate a full locals dict); 2. incompatible with the use of closure variables (since they're not visible in either locals() *or* globals())
The i-strings concept takes that closure-compatible interpolation capability and separates it from the str.format based rendering step.
From a speed perspective, the interpolation aspects of this approach are so efficient they rival simple string concatenation:
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'first + second' ..................... Mean +- std dev: 71.7 ns +- 2.1 ns
$ python -m perf timeit -s 'first = "Hello"; second = " World!"' 'f"{first}{second}"' ..................... Mean +- std dev: 77.8 ns +- 2.5 ns
Something like pyql that did more than just concatenate the text sections with the text values of the embedded expressions would still need some form of regex-style caching strategy to avoid parsing the same query string multiple times, but the Python interpreter would handle the task of breaking up the string into the text sections and the interpolated Python expressions.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 25 Mar 2017, at 15:51, Gerald Britton <gerald.britton@gmail.com> wrote:
No, the current solution is temporary because we just don’t have the manpower to implement the full thing: a real system that will rewrite parts of PythonQL queries and ship them to underlying databases. We need a real query optimizer and smart wrappers for this purpose. But we’ll build one of these for demo purposes soon (either a Spark wrapper or a PostgreSQL wrapper). One thought, if you're lacking in manpower now, then proposing inclusion into core Python means that the core dev team will be taking on an additional chunk of code that is already under-resourced. That rings alarm bells for me - how would you imagine the work needed to merge PythonQL into the core Python grammar would be resourced? I should say that in practice, I think that the solution is relatively niche, and overlaps quite significantly with existing Python features, so I don't really see a compelling case for inclusion. The parallel with C# and LINQ is interesting here - LINQ is a pretty cool technology, but I don't see it in widespread use in general-purpose C#
On 25 March 2017 at 11:24, Pavel Velikhov <pavel.velikhov at gmail.com <http://gmail.com/>> wrote: projects (disclaimer: I don't get to see much C# code, so my experience is limited).
I see lots of C# code, but (thankfully) not so much LINQ to SQL. Yes, it is a cool technology. But I sometimes have a problem with the SQL it generates. Since I'm also a SQL developer, I'm sensitive to how queries are constructed, for performance reasons, as well as how they look, for readability and aesthetic reasons.
LINQ queries can generate poorly-performing SQL, since LINQ is a basically a translator, but not an AI. As far as appearances go, LINQ queries can look pretty gnarly, especially if they include sub queries or a few joins. That makes it hard for the SQL dev (me!) to read and understand if there are performance problems (which there often are, in my experience)
We want to go beyond being a basic translator. Especially if the common use-case will be integrating multiple databases. We can also introduce decent-looking hints (maybe not always decent looking) to generate better plans. Not sure about asethetics though...
So, I would tend to code the SQL separately and put it in a SQL view, function or stored procedure. I can still parse the results with LINQ (not LINQ to SQL), which is fine.
For similar reasons, I'm not a huge fan of ORMs either. Probably my bias towards designing the database first and building up queries to meet the business goals before writing a line of Python, C#, or the language de jour.
This sounds completely reasonable, but this means you’re tied to a specific DBMS (especially if you’re using a lot of built-in functions that are usually very specific to a database). PythonQL (when it has enough functionality) should give you independence.
-- Gerald Britton, MCSE-DP, MVP LinkedIn Profile: http://ca.linkedin.com/in/geraldbritton <http://ca.linkedin.com/in/geraldbritton>_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (4)
-
Gerald Britton
-
Nick Coghlan
-
Pavel Velikhov
-
Wes Turner