Proposal: Query language extension to Python (PythonQL)

Hi folks! We started a project to extend Python with a full-blown query language about a year ago. The project is call PythonQL, the links are given below in the references section. We have implemented what is kind of an alpha version now, and gained some experience and insights about why and where this is really useful. So I’d like to share those with you and gather some opinions whether you think we should try to include these extensions in the Python core. Intro What we have done is (mostly) extended Python’s comprehensions with group by, order by, let and window clauses, which can come in any order, thus comprehensions become a query language a bit cleaner and more powerful than SQL. And we added a couple small convenience extensions, like a We have identified three top motivations for folks to use these extensions: Our Motivations 1. This can become a standard for running queries against database systems. Instead of learning a large number of different SQL dialects (the pain point here are libraries of functions and operators that are different for each vendor), the Python developer needs only to learn PythonQL and he can query any SQL and NoSQL database. 2. A single PythonQL expression can integrate a number of databases/files/memory structures seamlessly, with the PythonQL optimizer figuring out which pieces of plans to ship to which databases. This is a cool virtual database integration story that can be very convenient, especially now, when a lot of data scientists use Python to wrangle the data all day long. 3. Querying data structures inside Python with the full power of SQL (and a bit more) is also really convenient on its own. Usually folks that are well-versed in SQL have to resort to completely different means when they need to run a query in Python on top of some data structures. Current Status We have PythonQL running, its installed via pip and an encoding hack, that runs our preprocessor. We currently compile PythonQL into Python using our executor functions and execute Python subexpressions via eval. We don’t do any optimization / rewriting of queries into languages of underlying systems. And the query processor is basic too, with naive implementations of operators. But we’ve build DBMS systems before, so if there is a good amount of support for this project, we’ll be able to build a real system here. Your take on this Extending Python’s grammar is surely a painful thing for the community. We’re now convinced that it is well worth it, because of all the wonderful functionality and convenience this extension offers. We’d like to get your feedback on this and maybe you’ll suggest some next steps for us. References PythonQL GitHub page: https://github.com/pythonql/pythonql <https://github.com/pythonql/pythonql> PythonQL Intro and Tutorial (this is all User Documentation we have right now): https://github.com/pythonql/pythonql/wiki/PythonQL-Intro-and-Tutorial <https://github.com/pythonql/pythonql/wiki/PythonQL-Intro-and-Tutorial> A use-case of querying Event Logs and doing Process Mining with PythonQL: https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min... <https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min...> PythonQL demo site: www.pythonql.org <http://www.pythonql.org/> Best regards, PythonQL Team

On 3/24/2017 11:10 AM, Pavel Velikhov wrote:
No. PythonQL defines a comprehension-inspired SQL-like domain-specific (specialized) language. Its style of packing programs into expressions is contrary to that of Python. It appears to me that most of the added features duplicate ones already in Python (sorted, itertools, named tuples?). I think it should remain a separate project with its own development group and schedule. This is not to say that I would never use PQL. I like he idea of a uniform method of accessing in-memory and on-disk date, and like Python's current method of making files an iterable of lines. I believe the current DB API allows something similar. I think that the misuse of coding cookies, which makes it unusable in code that already has a proper coding cookie, should be replaced by normal imports. PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs. -- Terry Jan Reedy

Hi Terry! Thanks for your feedback, I have a couple comments below.
These features do exist separately, but usually they are quite a bit less powerful and convenient, then the ones we propose, and can’t be put into a single query/expressions. For example namedtuple requires one to define it first, the groupby in itertools doesn’t create named tuples and sorted takes a single expression. I do agree that synching releases with the overall Python code can become a problem...
This is not to say that I would never use PQL. I like he idea of a uniform method of accessing in-memory and on-disk date, and like Python's current method of making files an iterable of lines. I believe the current DB API allows something similar.
I think that the misuse of coding cookies, which makes it unusable in code that already has a proper coding cookie, should be replaced by normal imports. PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs.
So we see a lot of value of having PythonQL as a syntax extension, instead of having to define string queries and then executing them. It would definitely make our lives much simpler if we go with strings. But a lot of developers use ORMs just to avoid having to construct query string and have them break with a syntax error in different cases. In our case I guess we can avoid a lot of the hassle associated with query strings, since we can access all variables and functions from the context of the query. But when you’re doing interactive stuff in rapid development mode (e.g. doing some data science with Jupyter notebook) language integrated queries could be quite convenient. This is something we’ll need to think about...

Terry Reedy wrote:
PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs.
But embedding one language as quoted strings inside another is a horrible way to program. I really like the idea of a data manipulation language that is seamlessly integrated with the host language. Unfortunately, PQL does not seem to be that. It appears to only work on Python data, and their proposed solutions for hooking it up to databases and the like is to use some existing DB interfacing method to get the data into Python, and then use PQL on that. I can see little point in that, since as Terry points out, most of what PQL does can already be done fairly easily with existing Python facilities. To be worth extending the language, PQL queries would need to be able to operate directly on data in the database, and that would mean hooking into the semantics somehow so that PQL expressions are evaluated differently from normal Python expressions. I don't see anything there that mentions any such hooks, either existing or planned. -- Greg

No, the current solution is temporary because we just don’t have the manpower to implement the full thing: a real system that will rewrite parts of PythonQL queries and ship them to underlying databases. We need a real query optimizer and smart wrappers for this purpose. But we’ll build one of these for demo purposes soon (either a Spark wrapper or a PostgreSQL wrapper).
You can solve any problem with basic language facilities. But instead of a simple query expression you will end up with a bunch of for loops (in case of groupby) and numeric indexes into tuples. You can take a look at some of the queries in this use-case and see how they would look in pure Python: https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min... <https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min...>
This is definitely planned, currently PythonQL expressions are evaluated separately because we run them through a pre-processor when you specify the pythonql encoding. We might need to add some hooks like PonyORM does, otherwise we might have to trace iterators to their source thought AST, which can become messy. It is a lot of work, so we’re not promising this in the nearest future. As far as hooks (if the language in integrated into core Python) we will probably have to define some kind of a ‘Datasource’ wrapper function that would wrap database cursors and Spark RDDs, etc.

On 25 March 2017 at 11:24, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
One thought, if you're lacking in manpower now, then proposing inclusion into core Python means that the core dev team will be taking on an additional chunk of code that is already under-resourced. That rings alarm bells for me - how would you imagine the work needed to merge PythonQL into the core Python grammar would be resourced? I should say that in practice, I think that the solution is relatively niche, and overlaps quite significantly with existing Python features, so I don't really see a compelling case for inclusion. The parallel with C# and LINQ is interesting here - LINQ is a pretty cool technology, but I don't see it in widespread use in general-purpose C# projects (disclaimer: I don't get to see much C# code, so my experience is limited). Paul

Hi Paul!
An inclusion in core would definitely help us to grow the team, but I see your point. If we could get an idea that we’d be in the core if we do a) b) c) and have a big enough team to be responsive, that could also help us grow.
I’m not sure about the usual crowd of Python developers, but data scientists like the idea a lot, especially the future plans. If we’ll really have millions of data scientists soon, this could become pretty big. We’re also seeing a lot of much more advanced use-cases popping up, where PythonQL could really shine. Yes, LINQ didn’t go big, but maybe it was a bit ahead of its time.
Paul

Hello! If I had to provide a unified way of dealing with data, whatever its source is, I would probably go with creating an standardized ORM, probably based on Django's or PonyORM, because: - it doesn't require any change in the Python language - it allows queries for both reading and writing. I didn't see a way to write (UPDATE, CREATE and DELETE equivalents), but maybe I didn't look right. Or maybe I didn't understand the purpose at all! - it makes it easy (although not fast) to extend to any backend (files, databases, or any other kind of data storage) - as a pure python object, any IDE already supports it. - It can live as a separate module until it is stable enough to be integrated into standard library (if it should ever be integrated there). - it is way easier to learn and use. Comprehensions are not the easier things to work with. Most of the places I worked in forbid their use except for really easy and short cases (single line, single loop, simple tests). You're adding a whole new syntax with new keywords to one of the most complicated and less readable (yet efficient in many cases, don't get me wrong) part of the language. I'm not saying there is no need for what you're developing, there probably is if you did it, but maybe the solution you chose isn't the easier way to have it merged to the core language, and if it was, it would really be a long shot, as there are many new keywords and new syntax to discuss, implement and test. But I like the idea of a standard API to deal with data, a nice battery to be included. Side note : I might not have understood what you were doing, so if I'm off-topic, tell me ! -Brice Le 24/03/17 à 16:10, Pavel Velikhov a écrit :

On 25 Mar 2017, at 10:58, Brice PARENT <contact@brice.xyz> wrote:
Hello!
Hello!
If I had to provide a unified way of dealing with data, whatever its source is, I would probably go with creating an standardized ORM, probably based on Django's or PonyORM, because:
- it doesn't require any change in the Python language
So we basically want to be just like PonyORM (the part that executes comprehensions against objects and databases), except we find that the current comprehension syntax is not powerful enough to express all the queries we want. So we extended the comprehension syntax, and then our strategy is a lot like PonyORM.
- it allows queries for both reading and writing. I didn't see a way to write (UPDATE, CREATE and DELETE equivalents), but maybe I didn't look right. Or maybe I didn't understand the purpose at all!
Good point! This is a hard one, a single query in PythonQL can go against multiple databases, so doing updates with queries can be a big problem. We can add an insert/update/delete syntax to PythonQL and just track that these operations make sense.
- it makes it easy (although not fast) to extend to any backend (files, databases, or any other kind of data storage)
Going to different backends is not a huge problem, just a bit of work.
- as a pure python object, any IDE already supports it.
- It can live as a separate module until it is stable enough to be integrated into standard library (if it should ever be integrated there).
So we live as an external module via an encoding hack :)
- it is way easier to learn and use. Comprehensions are not the easier things to work with. Most of the places I worked in forbid their use except for really easy and short cases (single line, single loop, simple tests). You're adding a whole new syntax with new keywords to one of the most complicated and less readable (yet efficient in many cases, don't get me wrong) part of the language.
Yes, that’s true. But comprehensions are basically a small subset of SQL, we’re extending it a bit to get the full power of query language. So we’re catering to folks that already know this language or are willing to learn it for their daily needs.
I'm not saying there is no need for what you're developing, there probably is if you did it, but maybe the solution you chose isn't the easier way to have it merged to the core language, and if it was, it would really be a long shot, as there are many new keywords and new syntax to discuss, implement and test.
Yes, we started off by catering to folks like data scientists or similar folks that have to write really complex queries all the time.
Thanks for the feedback! This is very useful.

Pavel, I like PythonQL. I perform a lot of data transformation, and often find Python's list comprehensions too limiting; leaving me wishing for LINQ-like language features. As an alternative to extending Python with PythonQL, Terry Reedy suggested interpreting a DSL string, and Pavel Velikhov alluded to using magic method tricks found in ORM libraries. I can see how both these are not satisfactory. A third alternative could be to encode the query clauses as JSON objects. For example: result = [ select (x, sum_y) for x in range(1,8), y in range(1,7) where x % 2 == 0 and y % 2 != 0 and x > y group by x let sum_y = sum(y) where sum_y % 2 != 0 ] result = pq([ {"select":["x", "sum_y"]}, {"for":{"x": range(1,8), "y": range(1,7)}}, {"where": lambda x,y: x % 2 == 0 and y % 2 != 0 and x > y}, {"groupby": "x"}, {"with":{"sum_y":{"SUM":"y"}}, {"where": {"neq":[{"mod":["sum_y", 2]}, 0]}} ]) This representation does look a little lispy, and it may resemble PythonQL's parse tree. I think the benefits are: 1) no python language change 2) easier to parse 3) better than string-based DSL for catching syntax errors 4) {"clause": parameters} format is flexible for handling common query patterns ** 5) works in javascript too 6) easy to compose with automation (my favorite) It is probably easy for you to see the drawbacks. ** The `where` clause can accept a native lambda function, or an expression tree "If you are writing a loop, you are doing it wrong!" :) On 2017-03-24 11:10, Pavel Velikhov wrote:

On 3/25/2017 11:40 AM, Kyle Lahnakoski wrote:
PythonQL version
Someone mentioned the problem of adding multiple new keywords. Even 1 requires a proposal to meet a high bar; I think we average less than 1 new keyword per release in the last 20 years. Searching '\bgroup\b' just in /lib (the 3.6 stdlib on Windows) gets over 300 code hits in about 30 files. I think this makes in ineligible to bere's match.group() accounts for many. 'select' has fair number of code uses also. I also see 'where', 'let', and 'by' in the above.
-- Terry Jan Reedy

Terry,
Yes, we add quite a few keywords. If you look at the window clause we have, there are even more keywords there. This is definitely a huge concern and the main reason that the community would oppose the change in my view. I’m not too experienced with Python parser, but could we make all these keywords not be real keywords (only interpreted inside comprehension as keywords, not breaking any other code)?

On 3/26/2017 7:14 AM, Pavel Velikhov wrote:
It might be possible (or not!) to make the clause-heading words like 'where' or 'groupby' (this would have to be one word) recognized as special only in the context of starting a new comprehension clause. The precedents for 'keyword in context' are 'as', 'async', and 'await'. But these were temporary and a nuisance (both to code and for syntax highlighting) and I would not be in favor of repeating for this case. For direct integration with Python, I think you should work on and promote a more generic approach as Nick has suggested. Or work on a 3rd party environment that is not constrained by core python. Or you could consider making use of IDLE; it would be trivial to run code extracted from a text widget through a preprocessor before submitting it to compile(). -- Terry Jan Reedy

I prefer this a lot to the original syntax, and I really think this has much better chances to be integrated (if such an integration had to be done, and not kept as a separate module). Also, maybe managing this with classes instead of syntax could also be done easily (without any change to Python), like this: from pyql import PQL, Select, For, Where, GroupBy, Let result = PQL( Select("x", "sum_y"), For("x", range(1,8)), For("y",range(1,7)), Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y"),# function, *[arguments to pass to the function] Where("sum_y", lambda sum_y: sum_y %2!=0) GroupBy("x"), Let("sum_y", lambda y: sum(y), "y") ) (to be defined more precisely, I don't really like to rely on case to differentiate the "for" keyword and the "For" class, which by the way could be inherited from a more general "From" class, allowing to get the data from a database, pure python, a JSON/csv/xml file/object, or anything else.) With a nice lazy evaluation, in the order of the arguments of the constructor, I suppose you could achieve everything you need, and have an easily extendable syntax (create new objects between versions of the module, without ever having to create new keywords). There is no new parsing, and you already have the autocompletion of your IDE if you annotate correctly your code. You could even have this : query = PQL( Select("x", "sum_y"), Where("x", "y", lambda x, y: x %2==0andy %2!=0andx >y), Where("sum_y", lambda sum_y: sum_y %2!=0) GroupBy("x"), Let("sum_y", lambda y: sum(y), "y") # [name of the new var], function, *[arguments to pass to the function] ) query.execute(x=range(1,8),y=range(1,7)) or query.execute(PgDatabase(**dbsettings)) -Brice Le 25/03/17 à 16:40, Kyle Lahnakoski a écrit :

Hi Brice,
So here’s the deal: small queries will look pretty decent in pretty much all paradigms, ORM, or PythonQL or your proposal. Once they get bigger and combine multiple pain points (say outerjoins, grouping and nested data) - then unless you have a really clear and minimal language, folks will get confused and lost. We’ve gone through a few query languages that failed, including XQuery and others, and the main reason was the need to learn a whole new language and a bunch of libraries, nobody wanted to do it. So the main selling point behind PythonQL is: its Python that folks hopefully know already, with just a few extensions.

Le 27/03/17 à 10:55, Pavel Velikhov a écrit : this syntax is only used in PyQL sub-language, it's not really Python any more... Also, what I like with what I used, is that it is object-based, which allows any part of the query to be reusable or built dynamically. We might also extend such a PQL object's constructor to embed automatically whatever default parameters or database connection we want, or shared behaviours, like: class MyPQL(PQL): def get_limit(self): if self.limit is not None: return self.limit return 10 def __init__(self, *args): args.append(Let("sum_y", lambda y: sum(y), "y")) args.append(GroupBy("x")) super().__init__(*args) result = MyPQL( Select("x", "sum_y"), For("x", range(1,8)), For("y",range(1,7)), Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y"), Where("sum_y", lambda sum_y: sum_y %2!=0) ) Big queries, this way, may be split into smaller parts. And it allows you to do the following in a single query, instead of having to write one big for each condition where_from = [For("x", range(1,8)),For("y",range(1,7))] where = [Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y")] if filter_sum_y: where.append(Where("sum_y", lambda sum_y: sum_y %2!=0)) if group_by is not None: grouping = GroupBy("x") result = MyPQL(Select("x", "sum_y"), *where_from, *where, *grouping) Side note : I'm not a big database user, I mostly use ORMs (Django's and PonyORM depending on the projects) to access PgSQL and SQLite (for unit testing), so I might not even have use cases for what you're trying to solve. I just give my point of view here to explain what I think could be more easily integrated and (re)used. And as I'm a big fan of the DRY mentality, I'm not a fan of the syntax-chaining things (as well as I don't really like big nested comprehensions). -Brice

On 27 March 2017 at 10:54, Brice PARENT <contact@brice.xyz> wrote:
... which is why I suspect that this discussion would be better expressed as a suggestion that Python provide better support for domain specific languages like the one PythonQL offers. In that context, the "extended comprehension" format *would* be Python, specifically it would simply be a DSL embedded in Python using Python's standard features for doing that. Of course, that's just a re-framing of the perception, and the people who don't like sub-languages will be just as uncomfortable with DSLs. However, it does put this request into the context of DSL support, which is something that many languages provide, to a greater or lesser extent. For Python, Guido's traditionally been against allowing the language to be mutable in the way that DSLs permit, so in the first instance it's likely that the PythonQL proposal will face a lot of resistance. It's possible that PythonQL could provide a use case that shows the benefits of allowing DSLs to such an extent that Guido changes his mind, but that's not yet proven (and it's not really something that's been argued here yet). And it does change the discussion from being about who prefers which syntax, to being about where we want the language to go in terms of DSLs. Personally, I quite like limited DSL support (things like allowing no-parenthesis function calls can make it possible to write code that uses functions as if they were keywords). But it does impose a burden on people supporting the code because they have to understand the non-standard syntax. So I'm happy with Python's current choice to not go down that route, even though I do find it occasionally limiting. If I needed PythonQL features, I'd personally find Brice's class-based approach quite readable/acceptable. I find PythonQL form nice also, but not enough of an advantage to warrant all the extra keywords/syntax etc. Paul

Hi Pavel, This is a really impressive body of work. I had looked at this project in the past but it is great to get back up to speed and see all the progress made. I use Python + databases almost every day, and the major unanswered question is what benefit does dedicated language syntax have over using a DBAL/ORM with a Builder style API? It obviously has huge costs (as all syntax changes do) but the benefit is not obvious to me: I have never found myself wanting built-in syntax for writing database queries. My second thought is that every database layer I've ever used was unavoidably leaky or incomplete. Database functionality (even if we constrain "database" to mean RDBMS) is too diverse to be completely abstracted away. This is why so many different abstractions already exist, e.g. low-level like DBAPI and high-level like SQL Alchemy. You're not going to find much support for cementing an imperfect abstraction right into the Python grammar. In order to make the abstraction relatively complete, you'd need to almost complete merge ANSI SQL grammar into Python grammar, which sounds terrifying. Third thought: is the implementation of a "Python query language" as generic as the name implies? The docs mention support for document databases, but I can run Redis queries? LDAP queries? DNS queries?
Fourth thought: until PythonQL can abstract over a real database, it's far too early to consider putting it into the language itself. These kinds of "big change" projects typically need to stabilize on their own for a long time before anybody will even consider putting them into the core language. Finally – to end on a positive note – the coolest part of this project from my point of view is using SQL as an abstraction over in-memory objects or raw files. I can see how somebody that is comfortable with SQL would prefer this declarative approach. I could see myself using an API like this to search a Pandas dataframe, for example. Cheers, Mark On Fri, Mar 24, 2017 at 11:10 AM, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:

Hi Mark,
We do a lot work with data every day doing data science. We have to use tool like pandas, and they don’t work in a lot of cases and in many cases we end up with very cryptic notebooks that only the authors can work with… I actually use PythonQL in daily work and it simplified a lot of things greatly. The general insight is that a small language can add a lot more value than a huge library, because its easy to combine good ideas in a language. If you look at some examples of complex PythonQL, I think you’ll might change your mind: https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min... <https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min...>
My second thought is that every database layer I've ever used was unavoidably leaky or incomplete. Database functionality (even if we constrain "database" to mean RDBMS) is too diverse to be completely abstracted away. This is why so many different abstractions already exist, e.g. low-level like DBAPI and high-level like SQL Alchemy. You're not going to find much support for cementing an imperfect abstraction right into the Python grammar. In order to make the abstraction relatively complete, you'd need to almost complete merge ANSI SQL grammar into Python grammar, which sounds terrifying.
Don’t see a problem here, expect for a performance problem. I.e. you’ll be able to write queries of any complexity in PythonQL, and most of the work will be pushed into the underlying database. Stuff that can’t be pushed, will be finished up at the Python layer. We don’t have to guarantee the other direction - i.e. if a DBMS has transitive closure for instance, we don’t have to support it PythonQL.
Third thought: is the implementation of a "Python query language" as generic as the name implies? The docs mention support for document databases, but I can run Redis queries? LDAP queries? DNS queries?
We definitely can support Redis. LDAP and DNS - don’t know if we want to go there, I would stop at databases for now.
We haven't build a real SQL Database wrapper yet, but in the meanwhile you can use libraries like psycopg2 or SQLAlchemy to get data from the database into an iterator, and then PythonQL can run on top of such iterator.
Fourth thought: until PythonQL can abstract over a real database, it's far too early to consider putting it into the language itself. These kinds of "big change" projects typically need to stabilize on their own for a long time before anybody will even consider putting them into the core language.
We’re definitely at the start of this, because we have huge plans for PythonQL, including a powerful planner/optimizer and wrappers for most popular DBMSs. If we get the support of the Python community though it would help us to move faster for sure.
Finally – to end on a positive note – the coolest part of this project from my point of view is using SQL as an abstraction over in-memory objects or raw files. I can see how somebody that is comfortable with SQL would prefer this declarative approach. I could see myself using an API like this to search a Pandas dataframe, for example.
I think if we get this right, we might unlock some cool new usages. I really believe that if we simplify integration of multiple data sources sufficiently, a lot of dirty work of data scientists will become much simpler.

I think it's extraordinarily unlikely that a big change in Python syntax to support query syntax will ever happen. Moreover, I would oppose such a change myself. But just a change also really is not necessary. Pandas already abstracts all the things mentioned using only Python methods. It is true that Pandas sometimes does some black magic within those methods to get there; and it also uses somewhat non-Pythonic style of long chains of method calls. But it does everything PythonQL does, as well as much, much more. Pandas builds in DataFrame readers for every data source you are likely to encounter, including leveraging all the abstractions provided by RDBMS drivers, etc. It does groupby, join, etc. See, e.g.: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html Now there's one reasonable objection to Pandas: It doesn't handle larger-than-memory datasets well. I don't see that PythonQL is better in that regard. But there is an easy next step for that larger data. Blaze provides generic interfaces to many, many larger-than-memory data sources. It is largely a subset of the Pandas API, although not precisely that. See, e.g.: http://blaze.readthedocs.io/en/latest/rosetta-sql.html Moreover, within the Blaze "orbit" is Dask. This is a framework for parallel computation, one of whose abstractions is a DataFrame based on Pandas. This gives you 90% of those methods for slicing-and-dicing data that Pandas does, but deals seamlessly with larger-than-memory datasets. See, e.g.: http://dask.pydata.org/en/latest/dataframe.html So I think your burden is even higher than showing the usefulness of PythonQL. You have to show why it's worth adding new syntax to do somewhat LESS than is available in very widely used 3rd party tools that avoid new syntax. On Fri, Mar 24, 2017 at 8:10 AM, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David
I work daily with pandas, so of course it does have the functionality that PythonQL introduces, its a completely different beast. One of the reasons I started with PythonQL is because pandas is so difficult to master (just like any function-based database API would). The key benefit of PythonQL is that with minimal grammar extensions you get the power of a real query language. So a Python programmer who knows comprehensions well and has a good idea of SQL or other query languages can start writing complex queries right away. With pandas you need to read the docs all the time and complex data transformations become incredibly cryptic.

Hello, I've been following PythonQL with interest. I like the clever hack using Python encoding. It's definitely not something I would recommend to do for an inclusion in Python as it hijack the Python encoding method, which prevent your from... well choosing an encoding. And requires to have a file. However, I find the idea great for the demonstration purpose. Like LINQ, the strength of your tool is the integrated syntax. I myself found it annoying to import itertools all the time. I eventually wrote a wrapper so I could use slicing on generators, callable in slicing, etc for this very reason. However, I have good news. When the debate about f-strings was on this list, the concept has been spitted in several parts. The f-string currently implemented in Python 3.6, and an more advanced type of string interpolation: the i-string from PEP 501 (https://www.python.org/dev/peps/pep-0501/) that is still to be implemented. The idea of the i-string was to allow something like this: mycommand = sh(i"cat {filename}") myquery = sql(i"SELECT {column} FROM {table};") myresponse = html(i"<html><body>{response.body}</body></html>") Which would then pass an object to sql/sh/html() with the string, the placeholders and the variable context then allow it to do whatever you want. Evaluation of the i-string would of course be lazy. So while I don't thing PythonQL can be integrated in Python the way it is, you may want to champion PEP 501. This way you will be able to provide a PQL hook allowing you to do something like: pql(i"""select (x, sum_y) for x in range(1,8), y in {stuff} where x % 2 == 0 and y % 2 != 0 and x > y group by x let sum_y = sum(y) where sum_y % 2 != 0 """) Granted, this is not as elegant as your DSL, but that would make it easier to adopt anywhere: repl, ipython notebook, files in Python with a different encoding, embeded Python, alternative Python implementations compiled Python, etc. Plus the sooner we start with i-string, the sooner editors will implement syntax highlighting for the popular dialects. This would allow you to spread the popularity of your tool and maybe change the way it's seen on this list.

On 3/24/2017 11:10 AM, Pavel Velikhov wrote:
No. PythonQL defines a comprehension-inspired SQL-like domain-specific (specialized) language. Its style of packing programs into expressions is contrary to that of Python. It appears to me that most of the added features duplicate ones already in Python (sorted, itertools, named tuples?). I think it should remain a separate project with its own development group and schedule. This is not to say that I would never use PQL. I like he idea of a uniform method of accessing in-memory and on-disk date, and like Python's current method of making files an iterable of lines. I believe the current DB API allows something similar. I think that the misuse of coding cookies, which makes it unusable in code that already has a proper coding cookie, should be replaced by normal imports. PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs. -- Terry Jan Reedy

Hi Terry! Thanks for your feedback, I have a couple comments below.
These features do exist separately, but usually they are quite a bit less powerful and convenient, then the ones we propose, and can’t be put into a single query/expressions. For example namedtuple requires one to define it first, the groupby in itertools doesn’t create named tuples and sorted takes a single expression. I do agree that synching releases with the overall Python code can become a problem...
This is not to say that I would never use PQL. I like he idea of a uniform method of accessing in-memory and on-disk date, and like Python's current method of making files an iterable of lines. I believe the current DB API allows something similar.
I think that the misuse of coding cookies, which makes it unusable in code that already has a proper coding cookie, should be replaced by normal imports. PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs.
So we see a lot of value of having PythonQL as a syntax extension, instead of having to define string queries and then executing them. It would definitely make our lives much simpler if we go with strings. But a lot of developers use ORMs just to avoid having to construct query string and have them break with a syntax error in different cases. In our case I guess we can avoid a lot of the hassle associated with query strings, since we can access all variables and functions from the context of the query. But when you’re doing interactive stuff in rapid development mode (e.g. doing some data science with Jupyter notebook) language integrated queries could be quite convenient. This is something we’ll need to think about...

Terry Reedy wrote:
PQL expressions should be quoted and passed to the dsl processor, as done with SQL and other DSLs.
But embedding one language as quoted strings inside another is a horrible way to program. I really like the idea of a data manipulation language that is seamlessly integrated with the host language. Unfortunately, PQL does not seem to be that. It appears to only work on Python data, and their proposed solutions for hooking it up to databases and the like is to use some existing DB interfacing method to get the data into Python, and then use PQL on that. I can see little point in that, since as Terry points out, most of what PQL does can already be done fairly easily with existing Python facilities. To be worth extending the language, PQL queries would need to be able to operate directly on data in the database, and that would mean hooking into the semantics somehow so that PQL expressions are evaluated differently from normal Python expressions. I don't see anything there that mentions any such hooks, either existing or planned. -- Greg

No, the current solution is temporary because we just don’t have the manpower to implement the full thing: a real system that will rewrite parts of PythonQL queries and ship them to underlying databases. We need a real query optimizer and smart wrappers for this purpose. But we’ll build one of these for demo purposes soon (either a Spark wrapper or a PostgreSQL wrapper).
You can solve any problem with basic language facilities. But instead of a simple query expression you will end up with a bunch of for loops (in case of groupby) and numeric indexes into tuples. You can take a look at some of the queries in this use-case and see how they would look in pure Python: https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min... <https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min...>
This is definitely planned, currently PythonQL expressions are evaluated separately because we run them through a pre-processor when you specify the pythonql encoding. We might need to add some hooks like PonyORM does, otherwise we might have to trace iterators to their source thought AST, which can become messy. It is a lot of work, so we’re not promising this in the nearest future. As far as hooks (if the language in integrated into core Python) we will probably have to define some kind of a ‘Datasource’ wrapper function that would wrap database cursors and Spark RDDs, etc.

On 25 March 2017 at 11:24, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
One thought, if you're lacking in manpower now, then proposing inclusion into core Python means that the core dev team will be taking on an additional chunk of code that is already under-resourced. That rings alarm bells for me - how would you imagine the work needed to merge PythonQL into the core Python grammar would be resourced? I should say that in practice, I think that the solution is relatively niche, and overlaps quite significantly with existing Python features, so I don't really see a compelling case for inclusion. The parallel with C# and LINQ is interesting here - LINQ is a pretty cool technology, but I don't see it in widespread use in general-purpose C# projects (disclaimer: I don't get to see much C# code, so my experience is limited). Paul

Hi Paul!
An inclusion in core would definitely help us to grow the team, but I see your point. If we could get an idea that we’d be in the core if we do a) b) c) and have a big enough team to be responsive, that could also help us grow.
I’m not sure about the usual crowd of Python developers, but data scientists like the idea a lot, especially the future plans. If we’ll really have millions of data scientists soon, this could become pretty big. We’re also seeing a lot of much more advanced use-cases popping up, where PythonQL could really shine. Yes, LINQ didn’t go big, but maybe it was a bit ahead of its time.
Paul

Hello! If I had to provide a unified way of dealing with data, whatever its source is, I would probably go with creating an standardized ORM, probably based on Django's or PonyORM, because: - it doesn't require any change in the Python language - it allows queries for both reading and writing. I didn't see a way to write (UPDATE, CREATE and DELETE equivalents), but maybe I didn't look right. Or maybe I didn't understand the purpose at all! - it makes it easy (although not fast) to extend to any backend (files, databases, or any other kind of data storage) - as a pure python object, any IDE already supports it. - It can live as a separate module until it is stable enough to be integrated into standard library (if it should ever be integrated there). - it is way easier to learn and use. Comprehensions are not the easier things to work with. Most of the places I worked in forbid their use except for really easy and short cases (single line, single loop, simple tests). You're adding a whole new syntax with new keywords to one of the most complicated and less readable (yet efficient in many cases, don't get me wrong) part of the language. I'm not saying there is no need for what you're developing, there probably is if you did it, but maybe the solution you chose isn't the easier way to have it merged to the core language, and if it was, it would really be a long shot, as there are many new keywords and new syntax to discuss, implement and test. But I like the idea of a standard API to deal with data, a nice battery to be included. Side note : I might not have understood what you were doing, so if I'm off-topic, tell me ! -Brice Le 24/03/17 à 16:10, Pavel Velikhov a écrit :

On 25 Mar 2017, at 10:58, Brice PARENT <contact@brice.xyz> wrote:
Hello!
Hello!
If I had to provide a unified way of dealing with data, whatever its source is, I would probably go with creating an standardized ORM, probably based on Django's or PonyORM, because:
- it doesn't require any change in the Python language
So we basically want to be just like PonyORM (the part that executes comprehensions against objects and databases), except we find that the current comprehension syntax is not powerful enough to express all the queries we want. So we extended the comprehension syntax, and then our strategy is a lot like PonyORM.
- it allows queries for both reading and writing. I didn't see a way to write (UPDATE, CREATE and DELETE equivalents), but maybe I didn't look right. Or maybe I didn't understand the purpose at all!
Good point! This is a hard one, a single query in PythonQL can go against multiple databases, so doing updates with queries can be a big problem. We can add an insert/update/delete syntax to PythonQL and just track that these operations make sense.
- it makes it easy (although not fast) to extend to any backend (files, databases, or any other kind of data storage)
Going to different backends is not a huge problem, just a bit of work.
- as a pure python object, any IDE already supports it.
- It can live as a separate module until it is stable enough to be integrated into standard library (if it should ever be integrated there).
So we live as an external module via an encoding hack :)
- it is way easier to learn and use. Comprehensions are not the easier things to work with. Most of the places I worked in forbid their use except for really easy and short cases (single line, single loop, simple tests). You're adding a whole new syntax with new keywords to one of the most complicated and less readable (yet efficient in many cases, don't get me wrong) part of the language.
Yes, that’s true. But comprehensions are basically a small subset of SQL, we’re extending it a bit to get the full power of query language. So we’re catering to folks that already know this language or are willing to learn it for their daily needs.
I'm not saying there is no need for what you're developing, there probably is if you did it, but maybe the solution you chose isn't the easier way to have it merged to the core language, and if it was, it would really be a long shot, as there are many new keywords and new syntax to discuss, implement and test.
Yes, we started off by catering to folks like data scientists or similar folks that have to write really complex queries all the time.
Thanks for the feedback! This is very useful.

Pavel, I like PythonQL. I perform a lot of data transformation, and often find Python's list comprehensions too limiting; leaving me wishing for LINQ-like language features. As an alternative to extending Python with PythonQL, Terry Reedy suggested interpreting a DSL string, and Pavel Velikhov alluded to using magic method tricks found in ORM libraries. I can see how both these are not satisfactory. A third alternative could be to encode the query clauses as JSON objects. For example: result = [ select (x, sum_y) for x in range(1,8), y in range(1,7) where x % 2 == 0 and y % 2 != 0 and x > y group by x let sum_y = sum(y) where sum_y % 2 != 0 ] result = pq([ {"select":["x", "sum_y"]}, {"for":{"x": range(1,8), "y": range(1,7)}}, {"where": lambda x,y: x % 2 == 0 and y % 2 != 0 and x > y}, {"groupby": "x"}, {"with":{"sum_y":{"SUM":"y"}}, {"where": {"neq":[{"mod":["sum_y", 2]}, 0]}} ]) This representation does look a little lispy, and it may resemble PythonQL's parse tree. I think the benefits are: 1) no python language change 2) easier to parse 3) better than string-based DSL for catching syntax errors 4) {"clause": parameters} format is flexible for handling common query patterns ** 5) works in javascript too 6) easy to compose with automation (my favorite) It is probably easy for you to see the drawbacks. ** The `where` clause can accept a native lambda function, or an expression tree "If you are writing a loop, you are doing it wrong!" :) On 2017-03-24 11:10, Pavel Velikhov wrote:

On 3/25/2017 11:40 AM, Kyle Lahnakoski wrote:
PythonQL version
Someone mentioned the problem of adding multiple new keywords. Even 1 requires a proposal to meet a high bar; I think we average less than 1 new keyword per release in the last 20 years. Searching '\bgroup\b' just in /lib (the 3.6 stdlib on Windows) gets over 300 code hits in about 30 files. I think this makes in ineligible to bere's match.group() accounts for many. 'select' has fair number of code uses also. I also see 'where', 'let', and 'by' in the above.
-- Terry Jan Reedy

Terry,
Yes, we add quite a few keywords. If you look at the window clause we have, there are even more keywords there. This is definitely a huge concern and the main reason that the community would oppose the change in my view. I’m not too experienced with Python parser, but could we make all these keywords not be real keywords (only interpreted inside comprehension as keywords, not breaking any other code)?

On 3/26/2017 7:14 AM, Pavel Velikhov wrote:
It might be possible (or not!) to make the clause-heading words like 'where' or 'groupby' (this would have to be one word) recognized as special only in the context of starting a new comprehension clause. The precedents for 'keyword in context' are 'as', 'async', and 'await'. But these were temporary and a nuisance (both to code and for syntax highlighting) and I would not be in favor of repeating for this case. For direct integration with Python, I think you should work on and promote a more generic approach as Nick has suggested. Or work on a 3rd party environment that is not constrained by core python. Or you could consider making use of IDLE; it would be trivial to run code extracted from a text widget through a preprocessor before submitting it to compile(). -- Terry Jan Reedy

I prefer this a lot to the original syntax, and I really think this has much better chances to be integrated (if such an integration had to be done, and not kept as a separate module). Also, maybe managing this with classes instead of syntax could also be done easily (without any change to Python), like this: from pyql import PQL, Select, For, Where, GroupBy, Let result = PQL( Select("x", "sum_y"), For("x", range(1,8)), For("y",range(1,7)), Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y"),# function, *[arguments to pass to the function] Where("sum_y", lambda sum_y: sum_y %2!=0) GroupBy("x"), Let("sum_y", lambda y: sum(y), "y") ) (to be defined more precisely, I don't really like to rely on case to differentiate the "for" keyword and the "For" class, which by the way could be inherited from a more general "From" class, allowing to get the data from a database, pure python, a JSON/csv/xml file/object, or anything else.) With a nice lazy evaluation, in the order of the arguments of the constructor, I suppose you could achieve everything you need, and have an easily extendable syntax (create new objects between versions of the module, without ever having to create new keywords). There is no new parsing, and you already have the autocompletion of your IDE if you annotate correctly your code. You could even have this : query = PQL( Select("x", "sum_y"), Where("x", "y", lambda x, y: x %2==0andy %2!=0andx >y), Where("sum_y", lambda sum_y: sum_y %2!=0) GroupBy("x"), Let("sum_y", lambda y: sum(y), "y") # [name of the new var], function, *[arguments to pass to the function] ) query.execute(x=range(1,8),y=range(1,7)) or query.execute(PgDatabase(**dbsettings)) -Brice Le 25/03/17 à 16:40, Kyle Lahnakoski a écrit :

Hi Brice,
So here’s the deal: small queries will look pretty decent in pretty much all paradigms, ORM, or PythonQL or your proposal. Once they get bigger and combine multiple pain points (say outerjoins, grouping and nested data) - then unless you have a really clear and minimal language, folks will get confused and lost. We’ve gone through a few query languages that failed, including XQuery and others, and the main reason was the need to learn a whole new language and a bunch of libraries, nobody wanted to do it. So the main selling point behind PythonQL is: its Python that folks hopefully know already, with just a few extensions.

Le 27/03/17 à 10:55, Pavel Velikhov a écrit : this syntax is only used in PyQL sub-language, it's not really Python any more... Also, what I like with what I used, is that it is object-based, which allows any part of the query to be reusable or built dynamically. We might also extend such a PQL object's constructor to embed automatically whatever default parameters or database connection we want, or shared behaviours, like: class MyPQL(PQL): def get_limit(self): if self.limit is not None: return self.limit return 10 def __init__(self, *args): args.append(Let("sum_y", lambda y: sum(y), "y")) args.append(GroupBy("x")) super().__init__(*args) result = MyPQL( Select("x", "sum_y"), For("x", range(1,8)), For("y",range(1,7)), Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y"), Where("sum_y", lambda sum_y: sum_y %2!=0) ) Big queries, this way, may be split into smaller parts. And it allows you to do the following in a single query, instead of having to write one big for each condition where_from = [For("x", range(1,8)),For("y",range(1,7))] where = [Where(lambda x, y: x %2==0andy %2!=0andx >y, "x", "y")] if filter_sum_y: where.append(Where("sum_y", lambda sum_y: sum_y %2!=0)) if group_by is not None: grouping = GroupBy("x") result = MyPQL(Select("x", "sum_y"), *where_from, *where, *grouping) Side note : I'm not a big database user, I mostly use ORMs (Django's and PonyORM depending on the projects) to access PgSQL and SQLite (for unit testing), so I might not even have use cases for what you're trying to solve. I just give my point of view here to explain what I think could be more easily integrated and (re)used. And as I'm a big fan of the DRY mentality, I'm not a fan of the syntax-chaining things (as well as I don't really like big nested comprehensions). -Brice

On 27 March 2017 at 10:54, Brice PARENT <contact@brice.xyz> wrote:
... which is why I suspect that this discussion would be better expressed as a suggestion that Python provide better support for domain specific languages like the one PythonQL offers. In that context, the "extended comprehension" format *would* be Python, specifically it would simply be a DSL embedded in Python using Python's standard features for doing that. Of course, that's just a re-framing of the perception, and the people who don't like sub-languages will be just as uncomfortable with DSLs. However, it does put this request into the context of DSL support, which is something that many languages provide, to a greater or lesser extent. For Python, Guido's traditionally been against allowing the language to be mutable in the way that DSLs permit, so in the first instance it's likely that the PythonQL proposal will face a lot of resistance. It's possible that PythonQL could provide a use case that shows the benefits of allowing DSLs to such an extent that Guido changes his mind, but that's not yet proven (and it's not really something that's been argued here yet). And it does change the discussion from being about who prefers which syntax, to being about where we want the language to go in terms of DSLs. Personally, I quite like limited DSL support (things like allowing no-parenthesis function calls can make it possible to write code that uses functions as if they were keywords). But it does impose a burden on people supporting the code because they have to understand the non-standard syntax. So I'm happy with Python's current choice to not go down that route, even though I do find it occasionally limiting. If I needed PythonQL features, I'd personally find Brice's class-based approach quite readable/acceptable. I find PythonQL form nice also, but not enough of an advantage to warrant all the extra keywords/syntax etc. Paul

Hi Pavel, This is a really impressive body of work. I had looked at this project in the past but it is great to get back up to speed and see all the progress made. I use Python + databases almost every day, and the major unanswered question is what benefit does dedicated language syntax have over using a DBAL/ORM with a Builder style API? It obviously has huge costs (as all syntax changes do) but the benefit is not obvious to me: I have never found myself wanting built-in syntax for writing database queries. My second thought is that every database layer I've ever used was unavoidably leaky or incomplete. Database functionality (even if we constrain "database" to mean RDBMS) is too diverse to be completely abstracted away. This is why so many different abstractions already exist, e.g. low-level like DBAPI and high-level like SQL Alchemy. You're not going to find much support for cementing an imperfect abstraction right into the Python grammar. In order to make the abstraction relatively complete, you'd need to almost complete merge ANSI SQL grammar into Python grammar, which sounds terrifying. Third thought: is the implementation of a "Python query language" as generic as the name implies? The docs mention support for document databases, but I can run Redis queries? LDAP queries? DNS queries?
Fourth thought: until PythonQL can abstract over a real database, it's far too early to consider putting it into the language itself. These kinds of "big change" projects typically need to stabilize on their own for a long time before anybody will even consider putting them into the core language. Finally – to end on a positive note – the coolest part of this project from my point of view is using SQL as an abstraction over in-memory objects or raw files. I can see how somebody that is comfortable with SQL would prefer this declarative approach. I could see myself using an API like this to search a Pandas dataframe, for example. Cheers, Mark On Fri, Mar 24, 2017 at 11:10 AM, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:

Hi Mark,
We do a lot work with data every day doing data science. We have to use tool like pandas, and they don’t work in a lot of cases and in many cases we end up with very cryptic notebooks that only the authors can work with… I actually use PythonQL in daily work and it simplified a lot of things greatly. The general insight is that a small language can add a lot more value than a huge library, because its easy to combine good ideas in a language. If you look at some examples of complex PythonQL, I think you’ll might change your mind: https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min... <https://github.com/pythonql/pythonql/wiki/Event-Log-Querying-and-Process-Min...>
My second thought is that every database layer I've ever used was unavoidably leaky or incomplete. Database functionality (even if we constrain "database" to mean RDBMS) is too diverse to be completely abstracted away. This is why so many different abstractions already exist, e.g. low-level like DBAPI and high-level like SQL Alchemy. You're not going to find much support for cementing an imperfect abstraction right into the Python grammar. In order to make the abstraction relatively complete, you'd need to almost complete merge ANSI SQL grammar into Python grammar, which sounds terrifying.
Don’t see a problem here, expect for a performance problem. I.e. you’ll be able to write queries of any complexity in PythonQL, and most of the work will be pushed into the underlying database. Stuff that can’t be pushed, will be finished up at the Python layer. We don’t have to guarantee the other direction - i.e. if a DBMS has transitive closure for instance, we don’t have to support it PythonQL.
Third thought: is the implementation of a "Python query language" as generic as the name implies? The docs mention support for document databases, but I can run Redis queries? LDAP queries? DNS queries?
We definitely can support Redis. LDAP and DNS - don’t know if we want to go there, I would stop at databases for now.
We haven't build a real SQL Database wrapper yet, but in the meanwhile you can use libraries like psycopg2 or SQLAlchemy to get data from the database into an iterator, and then PythonQL can run on top of such iterator.
Fourth thought: until PythonQL can abstract over a real database, it's far too early to consider putting it into the language itself. These kinds of "big change" projects typically need to stabilize on their own for a long time before anybody will even consider putting them into the core language.
We’re definitely at the start of this, because we have huge plans for PythonQL, including a powerful planner/optimizer and wrappers for most popular DBMSs. If we get the support of the Python community though it would help us to move faster for sure.
Finally – to end on a positive note – the coolest part of this project from my point of view is using SQL as an abstraction over in-memory objects or raw files. I can see how somebody that is comfortable with SQL would prefer this declarative approach. I could see myself using an API like this to search a Pandas dataframe, for example.
I think if we get this right, we might unlock some cool new usages. I really believe that if we simplify integration of multiple data sources sufficiently, a lot of dirty work of data scientists will become much simpler.

I think it's extraordinarily unlikely that a big change in Python syntax to support query syntax will ever happen. Moreover, I would oppose such a change myself. But just a change also really is not necessary. Pandas already abstracts all the things mentioned using only Python methods. It is true that Pandas sometimes does some black magic within those methods to get there; and it also uses somewhat non-Pythonic style of long chains of method calls. But it does everything PythonQL does, as well as much, much more. Pandas builds in DataFrame readers for every data source you are likely to encounter, including leveraging all the abstractions provided by RDBMS drivers, etc. It does groupby, join, etc. See, e.g.: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html Now there's one reasonable objection to Pandas: It doesn't handle larger-than-memory datasets well. I don't see that PythonQL is better in that regard. But there is an easy next step for that larger data. Blaze provides generic interfaces to many, many larger-than-memory data sources. It is largely a subset of the Pandas API, although not precisely that. See, e.g.: http://blaze.readthedocs.io/en/latest/rosetta-sql.html Moreover, within the Blaze "orbit" is Dask. This is a framework for parallel computation, one of whose abstractions is a DataFrame based on Pandas. This gives you 90% of those methods for slicing-and-dicing data that Pandas does, but deals seamlessly with larger-than-memory datasets. See, e.g.: http://dask.pydata.org/en/latest/dataframe.html So I think your burden is even higher than showing the usefulness of PythonQL. You have to show why it's worth adding new syntax to do somewhat LESS than is available in very widely used 3rd party tools that avoid new syntax. On Fri, Mar 24, 2017 at 8:10 AM, Pavel Velikhov <pavel.velikhov@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David
I work daily with pandas, so of course it does have the functionality that PythonQL introduces, its a completely different beast. One of the reasons I started with PythonQL is because pandas is so difficult to master (just like any function-based database API would). The key benefit of PythonQL is that with minimal grammar extensions you get the power of a real query language. So a Python programmer who knows comprehensions well and has a good idea of SQL or other query languages can start writing complex queries right away. With pandas you need to read the docs all the time and complex data transformations become incredibly cryptic.

Hello, I've been following PythonQL with interest. I like the clever hack using Python encoding. It's definitely not something I would recommend to do for an inclusion in Python as it hijack the Python encoding method, which prevent your from... well choosing an encoding. And requires to have a file. However, I find the idea great for the demonstration purpose. Like LINQ, the strength of your tool is the integrated syntax. I myself found it annoying to import itertools all the time. I eventually wrote a wrapper so I could use slicing on generators, callable in slicing, etc for this very reason. However, I have good news. When the debate about f-strings was on this list, the concept has been spitted in several parts. The f-string currently implemented in Python 3.6, and an more advanced type of string interpolation: the i-string from PEP 501 (https://www.python.org/dev/peps/pep-0501/) that is still to be implemented. The idea of the i-string was to allow something like this: mycommand = sh(i"cat {filename}") myquery = sql(i"SELECT {column} FROM {table};") myresponse = html(i"<html><body>{response.body}</body></html>") Which would then pass an object to sql/sh/html() with the string, the placeholders and the variable context then allow it to do whatever you want. Evaluation of the i-string would of course be lazy. So while I don't thing PythonQL can be integrated in Python the way it is, you may want to champion PEP 501. This way you will be able to provide a PQL hook allowing you to do something like: pql(i"""select (x, sum_y) for x in range(1,8), y in {stuff} where x % 2 == 0 and y % 2 != 0 and x > y group by x let sum_y = sum(y) where sum_y % 2 != 0 """) Granted, this is not as elegant as your DSL, but that would make it easier to adopt anywhere: repl, ipython notebook, files in Python with a different encoding, embeded Python, alternative Python implementations compiled Python, etc. Plus the sooner we start with i-string, the sooner editors will implement syntax highlighting for the popular dialects. This would allow you to spread the popularity of your tool and maybe change the way it's seen on this list.
participants (11)
-
Brice PARENT
-
Chris Angelico
-
David Mertz
-
Greg Ewing
-
Kyle Lahnakoski
-
Mark E. Haase
-
Michel Desmoulin
-
Paul Moore
-
Pavel Velikhov
-
Pavol Lisy
-
Terry Reedy