[Python-ideas] Proposal: Query language extension to Python (PythonQL)

Sun Mar 26 11:02:09 EDT 2017

On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov at gmail.com> wrote:
> On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Right, the target audience here *isn't* folks who already know how to
>> construct their own relational queries in SQL, and it definitely isn't
>> folks that know how to tweak their queries to get optimal performance
>> from the specific database they're using. Rather, it's folks that
>> already know Python's comprehensions, and perhaps some of the
>> itertools features, and helping to provide them with a smoother
>> on-ramp into the world of relational data processing.
>
>
> Actually I myself am a user of PythonQL, even though I’m an SQL expert. I work in data science, so
> I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with.
> Some things like nested data models are also much better handled by PythonQL, and data like
> JSON or XML will also be easier to handle.

So perhaps a better way of framing it would be to say that PythonQL
aims to provide a middle ground between interfaces that are fully in
"Python mode" (e.g ORMs, pandas DataFrames), where the primary
interface is methods-on-objects, and those that are fully in "data
manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).

At the Python level, success for PythonQL would look like people being
able to seamlessly transfer their data manipulation skills from a
Django ORM project to an SQL Alchemy project to a pandas analysis
project to a distributed data analysis project in dask, without their
data manipulation code really having to change - only the backing data
structures and the runtime performance characteristics would differ.

At the data manipulation layer, success for PythonQL would look like
people being able to easily get "good enough" performance for one-off
scripts, regardless of the backing data store, with closer attention
to detail only being needed for genuinely large data sets (where
efficiency matters even for one-off analyses), or for frequently
repeated operations (where wasted CPU hours show up as increased
infrastructure expenses).

>> There's no question that folks dealing with sufficiently large data
>> sets with sufficiently stringent performance requirements are
>> eventually going to want to reach for handcrafted SQL or a distributed
>> computation framework like dask, but that's not really any different
>> from our standard position that when folks are attempting to optimise
>> a hot loop, they're eventually going to have to switch to something
>> that can eliminate the interpreter's default runtime object management
>> overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an
>> extension module in a different language entirely). It isn't an
>> argument against making it easier for folks to postpone the point
>> where they find it necessary to reach for the "something else" that
>> takes them beyond Python's default capabilities.
>
> Don’t know, for example one of the wrappers is going to be an Apache Spark
> wrappers, so you could quickly hack up a PythonQL query that would be run
> on a distributed platform.

Right, I meant this in the same sense that folks using an ORM like SQL
Alchemy may eventually hit a point where rather than trying to
convince the ORM to emit the SQL they want to run, it's easier to just
bypass the ORM layer and write the exact SQL they want.

It's worthwhile attempting to reduce the number of cases where folks
feel obliged to do that, but at the same time, abstraction layers need
to hide at least some lower level details if they're going to actually
work properly.

>> = Option 1 =
>>
>> Fully commit to the model of allowing alternate syntactic dialects to
>> run atop Python interpreters. In Hylang and PythonQL we have at least
>> two genuinely interesting examples of that working through the text
>> encoding system, as well as other examples like Cython that work
>> through the extension module system.
>>
>> So that's an opportunity to take this from "Possible, but a bit hacky"
>> to "Pluggable source code translation is supported at all levels of
>> the interpreter, including debugger source maps, etc" (perhaps by
>> borrowing ideas from other ecosytems like Java, JavaScript, and .NET,
>> where this kind of thing is already a lot more common.
>>
>> The downside of this approach is that actually making it happen would
>> be getting pretty far afield from the original PythonQL goal of
>> "provide nicer data manipulation abstractions in Python", and it
>> wouldn't actually deliver anything new that can't already be done with
>> existing import and codec system features.
>
> This would be great anyways, if we could rely on some preprocessor directive,
> instead of hacking encodings, this could be nice.

Victor Stinner wrote up some ideas about that in PEP 511:
https://www.python.org/dev/peps/pep-0511/

Preprocessing is one of the specific uses cases considered:
https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor

>> = Option 2 =
>>
>> ... given optionally delayed
>> rendering of interpolated strings, PythonQL could be used in the form:
>>
>>    result =pyql(i"""
>>        (x,y)
>>        for x in {range(1,8)}
>>        for y in {range(1,7)}
>>        if x % 2 == 0 and
>>           y % 2 != 0 and
>>           x > y
>>    """)
>>
>> I personally like this idea (otherwise I wouldn't have written PEP 501
>> in the first place), and the necessary technical underpinnings to
>> enable it are all largely already in place to support f-strings. If
>> the PEP were revised to show examples of using it to support
>> relatively seamless calling back and forth between Hylang, PythonQL
>> and regular Python code in the same process, that might be intriguing
>> enough to pique Guido's interest (and I'm open to adding co-authors
>> that are interested in pursuing that).
>
> What would be the difference between this and just executing a PythonQL
> string for us, getting local and global variables into PythonQL scope?

The big new technical capability that f-strings introduced is that the
compiler can see the variable references in the embedded expressions,
so f-strings "just work" with closure references, whereas passing
locals() and globals() explicitly is:

1. slow (since you have to generate a full locals dict);
2. incompatible with the use of closure variables (since they're not
visible in either locals() *or* globals())

The i-strings concept takes that closure-compatible interpolation
capability and separates it from the str.format based rendering step.

>From a speed perspective, the interpolation aspects of this approach
are so efficient they rival simple string concatenation:

$ python -m perf timeit -s 'first = "Hello"; second = " World!"'
'first + second'
.....................
Mean +- std dev: 71.7 ns +- 2.1 ns

$ python -m perf timeit -s 'first = "Hello"; second = " World!"'
'f"{first}{second}"'
.....................
Mean +- std dev: 77.8 ns +- 2.5 ns

Something like pyql that did more than just concatenate the text
sections with the text values of the embedded expressions would still
need some form of regex-style caching strategy to avoid parsing the
same query string multiple times, but the Python interpreter would
handle the task of breaking up the string into the text sections and
the interpolated Python expressions.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia