Decoupling type stubs for the public API from the pandas distribution
I discovered this feature of typing: https://www.python.org/dev/peps/pep-0561/#stub-only-packages The idea is that for a package like pandas, we can have a separate package "pandas-stubs" that would contain the type stubs for pandas. We wouldn't have to worry about including a `py.typed` file or `.pyi` files in our standard pandas distribution - all typing for the public API would be in the separate package. That would allow pandas typing for the public API to be maintained separately (different GitHub repo). We could start by just copying over what Microsoft created at https://github.com/microsoft/python-type-stubs/tree/main/pandas and then we maintain it as a separate repo, which could be installed via pip and conda. Any thoughts on whether we should consider doing this? -Irv
Hi Irv, I am not very familiar with the typing space so some questions below. Can you explain a bit more what would be the consequence of the type annotations in pandas itself? I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync? Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable? Would the type-stubs package be for a specific pandas version (and get somewhat synced releases?) Joris On Tue, 23 Nov 2021 at 17:22, Irv Lustig <irv@princeton.com> wrote:
I discovered this feature of typing: https://www.python.org/dev/peps/pep-0561/#stub-only-packages
The idea is that for a package like pandas, we can have a separate package "pandas-stubs" that would contain the type stubs for pandas. We wouldn't have to worry about including a `py.typed` file or `.pyi` files in our standard pandas distribution - all typing for the public API would be in the separate package. That would allow pandas typing for the public API to be maintained separately (different GitHub repo). We could start by just copying over what Microsoft created at https://github.com/microsoft/python-type-stubs/tree/main/pandas and then we maintain it as a separate repo, which could be installed via pip and conda.
Any thoughts on whether we should consider doing this?
-Irv
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Can you explain a bit more what would be the consequence of the type annotations in pandas itself?
We would keep the type annotations in pandas for maintaining the pandas code (i.e., type checking the code that is written by pandas developers), but not have to worry about typing the public API in conjunction with maintaining the internal typing. They could evolve separately, if needed.
I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync?
That's not entirely clear to me, but I would say that whenever the public API changes, then the pandas-stubs project would get updated.
Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable?
To me, the advantages are: 1. Maintainability - we just have to publish stubs for the public API and not any internal routines, and in some sense, the published stubs are a check for that API 2. Tests - we can develop a set of tests that test the type stubs independent of all the other tests we do 3. Reconciling Issues - with a separate project, any issues with the type stubs for the public API would be in a different GitHub project, which people who consume the API could contribute to, without having to worry about dealing with the full pandas code base, setting up a dev environment, etc. 4. Faster release schedule - because the type stubs code base would be small, as issues/PRs are reconciled, it could be released on a more regular basis, rather than waiting for a full pandas release. Regarding my comments (3) and (4) - I have been regularly contributing PRs to the Microsoft stubs that are included with Visual Studio Code https://github.com/microsoft/python-type-stubs/tree/main/pandas when I find issues with code that I write or members of my team write that doesn't pass the VS Code pyright basic type checks. Being able to do so without waiting for a full pandas release is very helpful! Since pylance in VS Code gets updated every week or two, that means that any changes in the type stubs that were approved by the maintainers end up getting released pretty quickly (and automatically updated). Would the type-stubs package be for a specific pandas version (and get
somewhat synced releases?)
I think we would sync it with minor releases, but not patch releases, since the public API shouldn't change in a patch release. I'd like to discuss this in the pandas dev meeting. Marco also pointed me to another set of stubs at https://github.com/VirtusLab/pandas-stubs . That latter project has a nice blog about how they created their stubs here: https://medium.com/virtuslab/pandas-stubs-how-we-enhanced-pandas-with-type-a... There is also https://github.com/predictive-analytics-lab/data-science-types/tree/master/p... -Irv On Tue, Dec 7, 2021 at 11:30 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi Irv,
I am not very familiar with the typing space so some questions below.
Can you explain a bit more what would be the consequence of the type annotations in pandas itself? I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync?
Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable? Would the type-stubs package be for a specific pandas version (and get somewhat synced releases?)
Joris
On Tue, 23 Nov 2021 at 17:22, Irv Lustig <irv@princeton.com> wrote:
I discovered this feature of typing: https://www.python.org/dev/peps/pep-0561/#stub-only-packages
The idea is that for a package like pandas, we can have a separate package "pandas-stubs" that would contain the type stubs for pandas. We wouldn't have to worry about including a `py.typed` file or `.pyi` files in our standard pandas distribution - all typing for the public API would be in the separate package. That would allow pandas typing for the public API to be maintained separately (different GitHub repo). We could start by just copying over what Microsoft created at https://github.com/microsoft/python-type-stubs/tree/main/pandas and then we maintain it as a separate repo, which could be installed via pip and conda.
Any thoughts on whether we should consider doing this?
-Irv
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi all, Just my two cents On 12/7/21 17:59, Irv Lustig wrote:
Can you explain a bit more what would be the consequence of the type annotations in pandas itself?
We would keep the type annotations in pandas for maintaining the pandas code (i.e., type checking the code that is written by pandas developers), but not have to worry about typing the public API in conjunction with maintaining the internal typing. They could evolve separately, if needed.
Meaningful type checking of internals typically requires well annotated public API ‒ every method which is not properly annotated, especially on the return side, introduces quickly escalating gaps in coverage. Unfortunately, there is no easy and robust way to combine multiple sources of annotations, so it is likely you'll still have to keep "public" API annotated alongside with "internal" parts. That introduces another problem ‒ if annotations for the public API diverge from external stubs, it is likely to be a source of confusion for the end users.
I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync?
That's not entirely clear to me, but I would say that whenever the public API changes, then the pandas-stubs project would get updated.
Keeping things in sync in a long run is quite hard (speaking as a long term maintainer of PySpark stubs), and some parts can be automated (i.e. checking for changes in automatically extracted signatures) and it is easy to miss changes that are not immediately visible in the signatures (i.e. subtle changes in types of accepted arguments and return type). Furthermore, (that observations is based mostly on some proprietary work) relationship between annotated code and annotations is not unidirectional ‒ how we annotate (and the same code can be annotated in different, but still valid ways) affects how you design your APIs. It is also not hard to create functions with signatures that are impossible to annotate.
Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable?
To me, the advantages are: 1. Maintainability - we just have to publish stubs for the public API and not any internal routines, and in some sense, the published stubs are a check for that API 2. Tests - we can develop a set of tests that test the type stubs independent of all the other tests we do 3. Reconciling Issues - with a separate project, any issues with the type stubs for the public API would be in a different GitHub project, which people who consume the API could contribute to, without having to worry about dealing with the full pandas code base, setting up a dev environment, etc. 4. Faster release schedule - because the type stubs code base would be small, as issues/PRs are reconciled, it could be released on a more regular basis, rather than waiting for a full pandas release.
These are really good points, especially when annotation effort is new. However, once annotations mature, there is really not much added value here. What's worse, if upstream API is evolving, you'll likely to face a problem of versioning ‒ which version of stubs is matching which version of the upstream package. That might require parallel versioning with version branches in the worst case scenario.
Regarding my comments (3) and (4) - I have been regularly contributing PRs to the Microsoft stubs that are included with Visual Studio Code https://github.com/microsoft/python-type-stubs/tree/main/pandas when I find issues with code that I write or members of my team write that doesn't pass the VS Code pyright basic type checks. Being able to do so without waiting for a full pandas release is very helpful! Since pylance in VS Code gets updated every week or two, that means that any changes in the type stubs that were approved by the maintainers end up getting released pretty quickly (and automatically updated).
Would the type-stubs package be for a specific pandas version (and get somewhat synced releases?)
I think we would sync it with minor releases, but not patch releases, since the public API shouldn't change in a patch release.
I'd like to discuss this in the pandas dev meeting. Marco also pointed me to another set of stubs at https://github.com/VirtusLab/pandas-stubs . That latter project has a nice blog about how they created their stubs here: https://medium.com/virtuslab/pandas-stubs-how-we-enhanced-pandas-with-type-a...
There is also https://github.com/predictive-analytics-lab/data-science-types/tree/master/p...
-Irv
On Tue, Dec 7, 2021 at 11:30 AM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi Irv,
I am not very familiar with the typing space so some questions below.
Can you explain a bit more what would be the consequence of the type annotations in pandas itself? I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync?
Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable? Would the type-stubs package be for a specific pandas version (and get somewhat synced releases?)
Joris
On Tue, 23 Nov 2021 at 17:22, Irv Lustig <irv@princeton.com> wrote:
I discovered this feature of typing: https://www.python.org/dev/peps/pep-0561/#stub-only-packages
The idea is that for a package like pandas, we can have a separate package "pandas-stubs" that would contain the type stubs for pandas. We wouldn't have to worry about including a `py.typed` file or `.pyi` files in our standard pandas distribution - all typing for the public API would be in the separate package. That would allow pandas typing for the public API to be maintained separately (different GitHub repo). We could start by just copying over what Microsoft created at https://github.com/microsoft/python-type-stubs/tree/main/pandas and then we maintain it as a separate repo, which could be installed via pip and conda.
Any thoughts on whether we should consider doing this?
-Irv
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
-- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC
participants (3)
-
Irv Lustig -
Joris Van den Bossche -
Maciej