On Sun, Mar 26, 2017 at 10:42 PM, Wes Turner <wes.turner@gmail.com> wrote:


On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
> On 25 Mar 2017, at 19:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
>> Right, the target audience here *isn't* folks who already know how to
>> construct their own relational queries in SQL, and it definitely isn't
>> folks that know how to tweak their queries to get optimal performance
>> from the specific database they're using. Rather, it's folks that
>> already know Python's comprehensions, and perhaps some of the
>> itertools features, and helping to provide them with a smoother
>> on-ramp into the world of relational data processing.
>
 
>
> Actually I myself am a user of PythonQL, even though I’m an SQL expert. I work in data science, so
> I do a lot of ad-hoc querying and we always get some new datasets we need to check out and work with.
> Some things like nested data models are also much better handled by PythonQL, and data like
> JSON or XML will also be easier to handle.

So perhaps a better way of framing it would be to say that PythonQL
aims to provide a middle ground between interfaces that are fully in
"Python mode" (e.g ORMs, pandas DataFrames), where the primary
interface is methods-on-objects, and those that are fully in "data
manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs).

At the Python level, success for PythonQL would look like people being
able to seamlessly transfer their data manipulation skills from a
Django ORM project to an SQL Alchemy project to a pandas analysis
project to a distributed data analysis project in dask, without their
data manipulation code really having to change - only the backing data
structures and the runtime performance characteristics would differ.

e.g. Django ORM to SQLAlchemy:
- Does this necessarily imply a metamodel for relations?
  - Django: GenericForeignKey
  - SQLAlchemy: sqlalchemy_utils.generic_relationship

> Does this necessarily imply a metamodel for relations?

Edges are expressed differently in different frameworks;
ultimately you're looking at a projection of a graph
(a constructed subset of a graph).

So solving this in the general case implies solving for graphs
(as well as (which includes) tree-based hierarchical data like SQL, arrays, documents, keys and values)).

1. Schema ("metamodel")
2. Query language


Q: How can Linked Data help define a metamodel for expressing relations (in order to harmonize search of disparate datasets)?

- It's a graph with schema constraints.
- Use URIs for Classes ("types"), Properties ("columns", "attributes"), and instances with @ids ("rows")
  - RDF, RDFS, OWL

- Search n databases asynchronously with SPARQL federation
  - Native SPARQL database (adapt the data)
  - SPARQL facade/frontend (adapt to an interface)

- Define/generate a schema representation for arbitrary data sources which {query language} {implementation A} can use to plan data-local queries and transformations
  -  JSONLD @context for data sources


### Property
Relations are expressed as properties of class instances.

rdf:Property
schema:Property
https://meta.schema.org/Property
https://meta.schema.org/inverseOf

owl:inverseOf
https://www.w3.org/TR/owl-ref/#inverseOf-def

Q: > "How can you provide documentation about the columns in a CSV file?"
https://www.w3.org/TR/tabular-data-primer/#documentation-columns
A: CSVW as [JSONLD,]
A: https://wrdrd.com/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows
    - How do we make these work with various stores?
      - How do we include columnar metadata like physical units and precision in databases without support for it?
        - JSON-LD manifest?

AFAIU, these don't handle relations:
- http://datashape.pydata.org/
https://github.com/apache/arrow/blob/master/format/Metadata.md


Q: "How can you describe the schema for multi-dimensional datasets (with complex relations)?"
A: https://www.w3.org/TR/vocab-data-cube/#data-cubes
The relations are otherwise defined as RDFS/OWL (e.g. as JSON-LD).


## Graph queries

### SPARQL
- SPARQL is a W3C Web Standard query language.
- SPARQL is not the only graph query language.


### Blueprints, Gremlin
Blueprints is a graph traversal/query API.

Gremlin implements the Blueprints API (also in Python); it's also generic like LINQ (like JDBC for graph databases):
https://tinkerpop.apache.org/docs/current/reference/#gremlin-python

### GraphQL
https://github.com/graphql-python/


... supporting relations across ORMs would be cool;
with enough abstraction IDK why it wouldn't look like RDFS/OWL.
 

...


At the data manipulation layer, success for PythonQL would look like
people being able to easily get "good enough" performance for one-off
scripts, regardless of the backing data store, with closer attention
to detail only being needed for genuinely large data sets (where
efficiency matters even for one-off analyses), or for frequently
repeated operations (where wasted CPU hours show up as increased
infrastructure expenses).
 

URIs

https://blaze.readthedocs.io/en/latest/uri.html#what-sorts-of-uris-does-blaze-support

```
What sorts of URIs does Blaze support?

Paths to files on disk, including the following extensions
- .csv
- .json
- .csv.gz/json.gz
- .hdf5 (uses h5py)
- .hdf5::/datapath
- hdfstore://filename.hdf5 (uses special pandas.HDFStore format)
- .bcolz
- .xls(x)

SQLAlchemy strings like the following
- sqlite:////absolute/path/to/myfile.db::tablename
- sqlite:////absolute/path/to/myfile.db (specify a particular table)
- postgresql://username:password@hostname:port
- impala://hostname (uses impyla)
- anything supported by SQLAlchemy

MongoDB Connection strings of the following form
- mongodb://username:password@hostname:port/database_name::collection_name

Blaze server strings of the following form
- blaze://hostname:port (port defaults to 6363)

In all cases when a location or table name is required in addition to the traditional URI (e.g. a data path within an HDF5 file or a Table/Collection name within a database) then that information follows on the end of the URI after a separator of two colons ::.

How it works
Blaze depends on the Odo library to handle URIs. URIs are managed through the resource function which is dispatched based on regular expressions. For example a simple resource function to handle .json files might look like the following (although Blaze’s actual solution is a bit more comprehensive):

from blaze import resource
import json

@resource.register('.+\.json')
def resource_json(uri):
    with open(uri):
        data = json.load(uri)
    return data

Can I extend this to my own types?
Absolutely. Import and extend resource as shown in the “How it works” section. The rest of Blaze will pick up your change automatically.
```

 

- dask

- odo

- zero-copy


- ibis
  - (a backend may compile to LLVM)
  - seeAlso: blaze, dask, "bulk synchronous parallel"
  - | Docs: http://docs.ibis-project.org/tutorial.html "Expression tutortial" jupyter notebooks
    - Apache Impala (incubating)
    - Apache Kudu (incubating)
    - Hadoop Distributed File System (HDFS)
    - PostgreSQL (Experimental)
    - SQLite
    - [ SQLAlchemy: { ... } ]

- apache beam
    - apache_beam.transforms

Somewhere in this list,
these become big data tools.


>> There's no question that folks dealing with sufficiently large data
>> sets with sufficiently stringent performance requirements are
>> eventually going to want to reach for handcrafted SQL or a distributed
>> computation framework like dask, but that's not really any different
>> from our standard position that when folks are attempting to optimise
>> a hot loop, they're eventually going to have to switch to something
>> that can eliminate the interpreter's default runtime object management
>> overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an
>> extension module in a different language entirely). It isn't an
>> argument against making it easier for folks to postpone the point
>> where they find it necessary to reach for the "something else" that
>> takes them beyond Python's default capabilities.
>
> Don’t know, for example one of the wrappers is going to be an Apache Spark
> wrappers, so you could quickly hack up a PythonQL query that would be run
> on a distributed platform.

Right, I meant this in the same sense that folks using an ORM like SQL
Alchemy may eventually hit a point where rather than trying to
convince the ORM to emit the SQL they want to run, it's easier to just
bypass the ORM layer and write the exact SQL they want.
 
At that point one can either:
- reflect the tables/mappings at devtime
- reflect the tables/mappings at runtime 

And then run the raw DBAPI query
(using appropriate query interpolation
 (-> i-strings and scoped configuration state)):

    session.execute("SELECT dbapi_version FROM ?", "tbl;name")



It's worthwhile attempting to reduce the number of cases where folks
feel obliged to do that, but at the same time, abstraction layers need
to hide at least some lower level details if they're going to actually
work properly.

>> = Option 1 =
>>
>> Fully commit to the model of allowing alternate syntactic dialects to
>> run atop Python interpreters. In Hylang and PythonQL we have at least
>> two genuinely interesting examples of that working through the text
>> encoding system, as well as other examples like Cython that work
>> through the extension module system.
>>
>> So that's an opportunity to take this from "Possible, but a bit hacky"
>> to "Pluggable source code translation is supported at all levels of
>> the interpreter, including debugger source maps, etc" (perhaps by
>> borrowing ideas from other ecosytems like Java, JavaScript, and .NET,
>> where this kind of thing is already a lot more common.
>>
>> The downside of this approach is that actually making it happen would
>> be getting pretty far afield from the original PythonQL goal of
>> "provide nicer data manipulation abstractions in Python", and it
>> wouldn't actually deliver anything new that can't already be done with
>> existing import and codec system features.
 
>
> This would be great anyways, if we could rely on some preprocessor directive,
> instead of hacking encodings, this could be nice.

Victor Stinner wrote up some ideas about that in PEP 511:
https://www.python.org/dev/peps/pep-0511/

Preprocessing is one of the specific uses cases considered:
https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor

>> = Option 2 =
>>
>> ... given optionally delayed
>> rendering of interpolated strings, PythonQL could be used in the form:
>>
>>    result =pyql(i"""
>>        (x,y)
>>        for x in {range(1,8)}
>>        for y in {range(1,7)}
>>        if x % 2 == 0 and
>>           y % 2 != 0 and
>>           x > y
>>    """)
>>
>> I personally like this idea (otherwise I wouldn't have written PEP 501
>> in the first place), and the necessary technical underpinnings to
>> enable it are all largely already in place to support f-strings. If
>> the PEP were revised to show examples of using it to support
>> relatively seamless calling back and forth between Hylang, PythonQL
>> and regular Python code in the same process, that might be intriguing
>> enough to pique Guido's interest (and I'm open to adding co-authors
>> that are interested in pursuing that).
>
> What would be the difference between this and just executing a PythonQL
> string for us, getting local and global variables into PythonQL scope?

The big new technical capability that f-strings introduced is that the
compiler can see the variable references in the embedded expressions,
so f-strings "just work" with closure references, whereas passing
locals() and globals() explicitly is:

1. slow (since you have to generate a full locals dict);
2. incompatible with the use of closure variables (since they're not
visible in either locals() *or* globals())

The i-strings concept takes that closure-compatible interpolation
capability and separates it from the str.format based rendering step.

>From a speed perspective, the interpolation aspects of this approach
are so efficient they rival simple string concatenation:

$ python -m perf timeit -s 'first = "Hello"; second = " World!"'
'first + second'
.....................
Mean +- std dev: 71.7 ns +- 2.1 ns

$ python -m perf timeit -s 'first = "Hello"; second = " World!"'
'f"{first}{second}"'
.....................
Mean +- std dev: 77.8 ns +- 2.5 ns

Something like pyql that did more than just concatenate the text
sections with the text values of the embedded expressions would still
need some form of regex-style caching strategy to avoid parsing the
same query string multiple times, but the Python interpreter would
handle the task of breaking up the string into the text sections and
the interpolated Python expressions.

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/