[Python-ideas] Proposal: Query language extension to Python (PythonQL)

Mon Mar 27 02:16:28 EDT 2017

On Sun, Mar 26, 2017 at 10:42 PM, Wes Turner <wes.turner at gmail.com> wrote:

>
>
> On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
>> On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov at gmail.com>
>> wrote:
>> > On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> >> Right, the target audience here *isn't* folks who already know how to
>> >> construct their own relational queries in SQL, and it definitely isn't
>> >> folks that know how to tweak their queries to get optimal performance
>> >> from the specific database they're using. Rather, it's folks that
>> >> already know Python's comprehensions, and perhaps some of the
>> >> itertools features, and helping to provide them with a smoother
>> >> on-ramp into the world of relational data processing.
>> >
>>
>
>
>> >
>> > Actually I myself am a user of PythonQL, even though I’m an SQL expert.
>> I work in data science, so
>> > I do a lot of ad-hoc querying and we always get some new datasets we
>> need to check out and work with.
>> > Some things like nested data models are also much better handled by
>> PythonQL, and data like
>> > JSON or XML will also be easier to handle.
>>
>> So perhaps a better way of framing it would be to say that PythonQL
>> aims to provide a middle ground between interfaces that are fully in
>> "Python mode" (e.g ORMs, pandas DataFrames), where the primary
>> interface is methods-on-objects, and those that are fully in "data
>> manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).
>>
>> At the Python level, success for PythonQL would look like people being
>> able to seamlessly transfer their data manipulation skills from a
>> Django ORM project to an SQL Alchemy project to a pandas analysis
>> project to a distributed data analysis project in dask, without their
>> data manipulation code really having to change - only the backing data
>> structures and the runtime performance characteristics would differ.
>>
>
> e.g. Django ORM to SQLAlchemy:
> - Does this necessarily imply a metamodel for relations?
>   - Django: GenericForeignKey
>   - SQLAlchemy: sqlalchemy_utils.generic_relationship
>

> Does this necessarily imply a metamodel for relations?

Edges are expressed differently in different frameworks;
ultimately you're looking at a projection of a graph
(a constructed subset of a graph).

So solving this in the general case implies solving for graphs
(as well as (which includes) tree-based hierarchical data like SQL, arrays,
documents, keys and values)).

1. Schema ("metamodel")
2. Query language

Q: How can Linked Data help define a metamodel for expressing relations (in
order to harmonize search of disparate datasets)?

- It's a graph with schema constraints.
- Use URIs for Classes ("types"), Properties ("columns", "attributes"), and
instances with @ids ("rows")
  - RDF, RDFS, OWL

- Search n databases asynchronously with SPARQL federation
  - Native SPARQL database (adapt the data)
  - SPARQL facade/frontend (adapt to an interface)

- Define/generate a schema representation for arbitrary data sources which
{query language} {implementation A} can use to plan data-local queries and
transformations
  -  JSONLD @context for data sources

### Property
Relations are expressed as properties of class instances.

rdf:Property
schema:Property
https://meta.schema.org/Property
- https://meta.schema.org/inverseOf

owl:inverseOf
https://www.w3.org/TR/owl-ref/#inverseOf-def

Q: > "How can you provide documentation about the columns in a CSV file?"
https://www.w3.org/TR/tabular-data-primer/#documentation-columns
A: CSVW as [JSONLD,]
A:
https://wrdrd.com/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows
    - How do we make these work with various stores?
      - How do we include columnar metadata like physical units and
precision in databases without support for it?
        - JSON-LD manifest?

AFAIU, these don't handle relations:
- http://datashape.pydata.org/
- https://github.com/apache/arrow/blob/master/format/Metadata.md

Q: "How can you describe the schema for multi-dimensional datasets (with
complex relations)?"
A: https://www.w3.org/TR/vocab-data-cube/#data-cubes
The relations are otherwise defined as RDFS/OWL (e.g. as JSON-LD).

## Graph queries

### SPARQL
- SPARQL is a W3C Web Standard query language.
- SPARQL is not the only graph query language.

### Blueprints, Gremlin
Blueprints is a graph traversal/query API.

   - There are many blueprints API implementations (e.g. Rexster, Neo4j
   <https://wrdrd.com/docs/consulting/knowledge-engineering#neo4j>,
   Blazegraph
   <https://wrdrd.com/docs/consulting/knowledge-engineering#blazegraph>,
   Accumulo
   <https://wrdrd.com/docs/consulting/knowledge-engineering#accumulo>)

Gremlin implements the Blueprints API (also in Python); it's also generic
like LINQ (like JDBC for graph databases):
https://tinkerpop.apache.org/docs/current/reference/#gremlin-python

### GraphQL
https://github.com/graphql-python/

... supporting relations across ORMs would be cool;
with enough abstraction IDK why it wouldn't look like RDFS/OWL.

>
> ...
>
>
>> At the data manipulation layer, success for PythonQL would look like
>> people being able to easily get "good enough" performance for one-off
>> scripts, regardless of the backing data store, with closer attention
>> to detail only being needed for genuinely large data sets (where
>> efficiency matters even for one-off analyses), or for frequently
>> repeated operations (where wasted CPU hours show up as increased
>> infrastructure expenses).
>>
>

>
> http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core
> (dask, blaze, odo, )
>
> http://blaze.pydata.org/
>
> - blaze
>   - | Src: https://github.com/blaze/blaze
>   - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-pandas.html
>   - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-sql.html
>   - | Docs: https://blaze.readthedocs.io/en/latest/backends.html
>     - "Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark"
>

*URIs*

https://blaze.readthedocs.io/en/latest/uri.html#what-sorts-of-uris-does-blaze-support

```
What sorts of URIs does Blaze support?

Paths to files on disk, including the following extensions
- .csv
- .json
- .csv.gz/json.gz
- .hdf5 (uses h5py)
- .hdf5::/datapath
- hdfstore://filename.hdf5 (uses special pandas.HDFStore format)
- .bcolz
- .xls(x)

SQLAlchemy strings like the following
- sqlite:////absolute/path/to/myfile.db::tablename
- sqlite:////absolute/path/to/myfile.db (specify a particular table)
- postgresql://username:password@hostname:port
- impala://hostname (uses impyla)
- anything supported by SQLAlchemy

MongoDB Connection strings of the following form
- mongodb://username:password@hostname:port/database_name::collection_name

Blaze server strings of the following form
- blaze://hostname:port (port defaults to 6363)

In all cases when a location or table name is required in addition to the
traditional URI (e.g. a data path within an HDF5 file or a Table/Collection
name within a database) *then that information follows on the end of the
URI after a separator of two colons ::.*

How it works
Blaze depends on the Odo library to handle URIs. URIs are managed through
the resource function which is dispatched based on regular expressions. For
example a simple resource function to handle .json files might look like
the following (although Blaze’s actual solution is a bit more
comprehensive):

from blaze import resource
import json

@resource.register('.+\.json')
def resource_json(uri):
    with open(uri):
        data = json.load(uri)
    return data

Can I extend this to my own types?
Absolutely. Import and extend *resource* as shown in the “How it works”
section. The rest of Blaze will pick up your change automatically.
```

>
> - dask
>   - | Src: https://github.com/dask/dask
>   - | Docs: http://dask.pydata.org/en/latest/#familiar-user-interface **
>   - | Docs: http://dask.pydata.org/en/latest/scheduler-choice.html
>   - http://xarray.pydata.org/en/stable/dask.html
>
> - odo
>   - | Src: https://github.com/blaze/blaze#odo
>   - | Docs: https://odo.readthedocs.io/en/latest/#formats
>
> - zero-copy
>   - https://www.slideshare.net/wesm/memory-interoperability-
> in-analytics-and-machine-learning
>   - https://github.com/alex/zero_buffer/blob/master/zero_buffer.py
>
>
> - ibis
>   - | Src: https://github.com/cloudera/ibis
>   - (a backend may compile to LLVM)
>   - seeAlso: blaze, dask, "bulk synchronous parallel"
>   - | Docs: http://docs.ibis-project.org/sql.html
>   - | Docs: http://docs.ibis-project.org/tutorial.html "Expression
> tutortial" jupyter notebooks
>   - | Docs: http://docs.ibis-project.org/
>     - Apache Impala (incubating)
>     - Apache Kudu (incubating)
>     - Hadoop Distributed File System (HDFS)
>     - PostgreSQL (Experimental)
>     - SQLite
>     - [ SQLAlchemy: { ... } ]
>       - | Src: https://github.com/cloudera/ibis/blob/master/ibis/sql/
> alchemy.py
>
> - apache beam
>   - https://beam.apache.org/documentation/sdks/python/
>   - https://beam.apache.org/get-started/quickstart-py/ (pip install
> apache-beam)
>   - https://beam.apache.org/documentation/sdks/pydoc/0.6.0/
>     - apache_beam.transforms
>   - https://beam.apache.org/documentation/programming-guide/#transforms
> "Applying transforms"
>
> Somewhere in this list,
> these become big data tools.
>
>
>> >> There's no question that folks dealing with sufficiently large data
>> >> sets with sufficiently stringent performance requirements are
>> >> eventually going to want to reach for handcrafted SQL or a distributed
>> >> computation framework like dask, but that's not really any different
>> >> from our standard position that when folks are attempting to optimise
>> >> a hot loop, they're eventually going to have to switch to something
>> >> that can eliminate the interpreter's default runtime object management
>> >> overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an
>> >> extension module in a different language entirely). It isn't an
>> >> argument against making it easier for folks to postpone the point
>> >> where they find it necessary to reach for the "something else" that
>> >> takes them beyond Python's default capabilities.
>> >
>> > Don’t know, for example one of the wrappers is going to be an Apache
>> Spark
>> > wrappers, so you could quickly hack up a PythonQL query that would be
>> run
>> > on a distributed platform.
>>
>> Right, I meant this in the same sense that folks using an ORM like SQL
>> Alchemy may eventually hit a point where rather than trying to
>> convince the ORM to emit the SQL they want to run, it's easier to just
>> bypass the ORM layer and write the exact SQL they want.
>>
>
> At that point one can either:
> - reflect the tables/mappings at devtime
> - reflect the tables/mappings at runtime
>
> And then run the raw DBAPI query
> (using appropriate query interpolation
>  (-> i-strings and scoped configuration state)):
>
>     session.execute("SELECT dbapi_version FROM ?", "tbl;name")
>
>
>
>> It's worthwhile attempting to reduce the number of cases where folks
>> feel obliged to do that, but at the same time, abstraction layers need
>> to hide at least some lower level details if they're going to actually
>> work properly.
>>
>> >> = Option 1 =
>> >>
>> >> Fully commit to the model of allowing alternate syntactic dialects to
>> >> run atop Python interpreters. In Hylang and PythonQL we have at least
>> >> two genuinely interesting examples of that working through the text
>> >> encoding system, as well as other examples like Cython that work
>> >> through the extension module system.
>> >>
>> >> So that's an opportunity to take this from "Possible, but a bit hacky"
>> >> to "Pluggable source code translation is supported at all levels of
>> >> the interpreter, including debugger source maps, etc" (perhaps by
>> >> borrowing ideas from other ecosytems like Java, JavaScript, and .NET,
>> >> where this kind of thing is already a lot more common.
>> >>
>> >> The downside of this approach is that actually making it happen would
>> >> be getting pretty far afield from the original PythonQL goal of
>> >> "provide nicer data manipulation abstractions in Python", and it
>> >> wouldn't actually deliver anything new that can't already be done with
>> >> existing import and codec system features.
>
> >
>> > This would be great anyways, if we could rely on some preprocessor
>> directive,
>> > instead of hacking encodings, this could be nice.
>>
>> Victor Stinner wrote up some ideas about that in PEP 511:
>> https://www.python.org/dev/peps/pep-0511/
>>
>> Preprocessing is one of the specific uses cases considered:
>> https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor
>>
>> >> = Option 2 =
>> >>
>> >> ... given optionally delayed
>> >> rendering of interpolated strings, PythonQL could be used in the form:
>> >>
>> >>    result =pyql(i"""
>> >>        (x,y)
>> >>        for x in {range(1,8)}
>> >>        for y in {range(1,7)}
>> >>        if x % 2 == 0 and
>> >>           y % 2 != 0 and
>> >>           x > y
>> >>    """)
>> >>
>> >> I personally like this idea (otherwise I wouldn't have written PEP 501
>> >> in the first place), and the necessary technical underpinnings to
>> >> enable it are all largely already in place to support f-strings. If
>> >> the PEP were revised to show examples of using it to support
>> >> relatively seamless calling back and forth between Hylang, PythonQL
>> >> and regular Python code in the same process, that might be intriguing
>> >> enough to pique Guido's interest (and I'm open to adding co-authors
>> >> that are interested in pursuing that).
>> >
>> > What would be the difference between this and just executing a PythonQL
>> > string for us, getting local and global variables into PythonQL scope?
>>
>> The big new technical capability that f-strings introduced is that the
>> compiler can see the variable references in the embedded expressions,
>> so f-strings "just work" with closure references, whereas passing
>> locals() and globals() explicitly is:
>>
>> 1. slow (since you have to generate a full locals dict);
>> 2. incompatible with the use of closure variables (since they're not
>> visible in either locals() *or* globals())
>>
>> The i-strings concept takes that closure-compatible interpolation
>> capability and separates it from the str.format based rendering step.
>>
>> From a speed perspective, the interpolation aspects of this approach
>> are so efficient they rival simple string concatenation:
>>
>> $ python -m perf timeit -s 'first = "Hello"; second = " World!"'
>> 'first + second'
>> .....................
>> Mean +- std dev: 71.7 ns +- 2.1 ns
>>
>> $ python -m perf timeit -s 'first = "Hello"; second = " World!"'
>> 'f"{first}{second}"'
>> .....................
>> Mean +- std dev: 77.8 ns +- 2.5 ns
>>
>> Something like pyql that did more than just concatenate the text
>> sections with the text values of the embedded expressions would still
>> need some form of regex-style caching strategy to avoid parsing the
>> same query string multiple times, but the Python interpreter would
>> handle the task of breaking up the string into the text sections and
>> the interpolated Python expressions.
>>
>> Cheers,
>> Nick.
>>
>> --
>> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170327/3e0dbdf8/attachment-0001.html>