[Python-ideas] Proposal: Query language extension to Python (PythonQL)

Sun Mar 26 23:42:35 EDT 2017

On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov at gmail.com>
> wrote:
> > On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan at gmail.com> wrote:
> >> Right, the target audience here *isn't* folks who already know how to
> >> construct their own relational queries in SQL, and it definitely isn't
> >> folks that know how to tweak their queries to get optimal performance
> >> from the specific database they're using. Rather, it's folks that
> >> already know Python's comprehensions, and perhaps some of the
> >> itertools features, and helping to provide them with a smoother
> >> on-ramp into the world of relational data processing.
> >
>

> >
> > Actually I myself am a user of PythonQL, even though I’m an SQL expert.
> I work in data science, so
> > I do a lot of ad-hoc querying and we always get some new datasets we
> need to check out and work with.
> > Some things like nested data models are also much better handled by
> PythonQL, and data like
> > JSON or XML will also be easier to handle.
>
> So perhaps a better way of framing it would be to say that PythonQL
> aims to provide a middle ground between interfaces that are fully in
> "Python mode" (e.g ORMs, pandas DataFrames), where the primary
> interface is methods-on-objects, and those that are fully in "data
> manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).
>
> At the Python level, success for PythonQL would look like people being
> able to seamlessly transfer their data manipulation skills from a
> Django ORM project to an SQL Alchemy project to a pandas analysis
> project to a distributed data analysis project in dask, without their
> data manipulation code really having to change - only the backing data
> structures and the runtime performance characteristics would differ.
>

e.g. Django ORM to SQLAlchemy:
- Does this necessarily imply a metamodel for relations?
  - Django: GenericForeignKey
  - SQLAlchemy: sqlalchemy_utils.generic_relationship

...

> At the data manipulation layer, success for PythonQL would look like
> people being able to easily get "good enough" performance for one-off
> scripts, regardless of the backing data store, with closer attention
> to detail only being needed for genuinely large data sets (where
> efficiency matters even for one-off analyses), or for frequently
> repeated operations (where wasted CPU hours show up as increased
> infrastructure expenses).
>

http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core
(dask, blaze, odo, )

http://blaze.pydata.org/

- blaze
  - | Src: https://github.com/blaze/blaze
  - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-pandas.html
  - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-sql.html
  - | Docs: https://blaze.readthedocs.io/en/latest/backends.html
    - "Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark"

- dask
  - | Src: https://github.com/dask/dask
  - | Docs: http://dask.pydata.org/en/latest/#familiar-user-interface **
  - | Docs: http://dask.pydata.org/en/latest/scheduler-choice.html
  - http://xarray.pydata.org/en/stable/dask.html

- odo
  - | Src: https://github.com/blaze/blaze#odo
  - | Docs: https://odo.readthedocs.io/en/latest/#formats

- zero-copy
  -
https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning
  - https://github.com/alex/zero_buffer/blob/master/zero_buffer.py

- ibis
  - | Src: https://github.com/cloudera/ibis
  - (a backend may compile to LLVM)
  - seeAlso: blaze, dask, "bulk synchronous parallel"
  - | Docs: http://docs.ibis-project.org/sql.html
  - | Docs: http://docs.ibis-project.org/tutorial.html "Expression
tutortial" jupyter notebooks
  - | Docs: http://docs.ibis-project.org/
    - Apache Impala (incubating)
    - Apache Kudu (incubating)
    - Hadoop Distributed File System (HDFS)
    - PostgreSQL (Experimental)
    - SQLite
    - [ SQLAlchemy: { ... } ]
      - | Src:
https://github.com/cloudera/ibis/blob/master/ibis/sql/alchemy.py

- apache beam
  - https://beam.apache.org/documentation/sdks/python/
  - https://beam.apache.org/get-started/quickstart-py/ (pip install
apache-beam)
  - https://beam.apache.org/documentation/sdks/pydoc/0.6.0/
    - apache_beam.transforms
  - https://beam.apache.org/documentation/programming-guide/#transforms
"Applying transforms"

Somewhere in this list,
these become big data tools.

> >> There's no question that folks dealing with sufficiently large data
> >> sets with sufficiently stringent performance requirements are
> >> eventually going to want to reach for handcrafted SQL or a distributed
> >> computation framework like dask, but that's not really any different
> >> from our standard position that when folks are attempting to optimise
> >> a hot loop, they're eventually going to have to switch to something
> >> that can eliminate the interpreter's default runtime object management
> >> overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an
> >> extension module in a different language entirely). It isn't an
> >> argument against making it easier for folks to postpone the point
> >> where they find it necessary to reach for the "something else" that
> >> takes them beyond Python's default capabilities.
> >
> > Don’t know, for example one of the wrappers is going to be an Apache
> Spark
> > wrappers, so you could quickly hack up a PythonQL query that would be run
> > on a distributed platform.
>
> Right, I meant this in the same sense that folks using an ORM like SQL
> Alchemy may eventually hit a point where rather than trying to
> convince the ORM to emit the SQL they want to run, it's easier to just
> bypass the ORM layer and write the exact SQL they want.
>

At that point one can either:
- reflect the tables/mappings at devtime
- reflect the tables/mappings at runtime

And then run the raw DBAPI query
(using appropriate query interpolation
 (-> i-strings and scoped configuration state)):

    session.execute("SELECT dbapi_version FROM ?", "tbl;name")

> It's worthwhile attempting to reduce the number of cases where folks
> feel obliged to do that, but at the same time, abstraction layers need
> to hide at least some lower level details if they're going to actually
> work properly.
>
> >> = Option 1 =
> >>
> >> Fully commit to the model of allowing alternate syntactic dialects to
> >> run atop Python interpreters. In Hylang and PythonQL we have at least
> >> two genuinely interesting examples of that working through the text
> >> encoding system, as well as other examples like Cython that work
> >> through the extension module system.
> >>
> >> So that's an opportunity to take this from "Possible, but a bit hacky"
> >> to "Pluggable source code translation is supported at all levels of
> >> the interpreter, including debugger source maps, etc" (perhaps by
> >> borrowing ideas from other ecosytems like Java, JavaScript, and .NET,
> >> where this kind of thing is already a lot more common.
> >>
> >> The downside of this approach is that actually making it happen would
> >> be getting pretty far afield from the original PythonQL goal of
> >> "provide nicer data manipulation abstractions in Python", and it
> >> wouldn't actually deliver anything new that can't already be done with
> >> existing import and codec system features.

>
> > This would be great anyways, if we could rely on some preprocessor
> directive,
> > instead of hacking encodings, this could be nice.
>
> Victor Stinner wrote up some ideas about that in PEP 511:
> https://www.python.org/dev/peps/pep-0511/
>
> Preprocessing is one of the specific uses cases considered:
> https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
>
> >> = Option 2 =
> >>
> >> ... given optionally delayed
> >> rendering of interpolated strings, PythonQL could be used in the form:
> >>
> >>    result =pyql(i"""
> >>        (x,y)
> >>        for x in {range(1,8)}
> >>        for y in {range(1,7)}
> >>        if x % 2 == 0 and
> >>           y % 2 != 0 and
> >>           x > y
> >>    """)
> >>
> >> I personally like this idea (otherwise I wouldn't have written PEP 501
> >> in the first place), and the necessary technical underpinnings to
> >> enable it are all largely already in place to support f-strings. If
> >> the PEP were revised to show examples of using it to support
> >> relatively seamless calling back and forth between Hylang, PythonQL
> >> and regular Python code in the same process, that might be intriguing
> >> enough to pique Guido's interest (and I'm open to adding co-authors
> >> that are interested in pursuing that).
> >
> > What would be the difference between this and just executing a PythonQL
> > string for us, getting local and global variables into PythonQL scope?
>
> The big new technical capability that f-strings introduced is that the
> compiler can see the variable references in the embedded expressions,
> so f-strings "just work" with closure references, whereas passing
> locals() and globals() explicitly is:
>
> 1. slow (since you have to generate a full locals dict);
> 2. incompatible with the use of closure variables (since they're not
> visible in either locals() *or* globals())
>
> The i-strings concept takes that closure-compatible interpolation
> capability and separates it from the str.format based rendering step.
>
> From a speed perspective, the interpolation aspects of this approach
> are so efficient they rival simple string concatenation:
>
> $ python -m perf timeit -s 'first = "Hello"; second = " World!"'
> 'first + second'
> .....................
> Mean +- std dev: 71.7 ns +- 2.1 ns
>
> $ python -m perf timeit -s 'first = "Hello"; second = " World!"'
> 'f"{first}{second}"'
> .....................
> Mean +- std dev: 77.8 ns +- 2.5 ns
>
> Something like pyql that did more than just concatenate the text
> sections with the text values of the embedded expressions would still
> need some form of regex-style caching strategy to avoid parsing the
> same query string multiple times, but the Python interpreter would
> handle the task of breaking up the string into the text sections and
> the interpolated Python expressions.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170326/522bfc46/attachment-0001.html>