Tool to search in the source code of PyPI top 5000 projects
Hi, I wrote two scripts based on the work of INADA-san's work to (1) download the source code of the PyPI top 5000 projects (2) search for a regex in these projects (compressed source archives). You can use these tools if you work on an incompatible Python or C API change to estimate how many projects are impacted. The HPy project created a Git repository for a similar need (latest update in June 2021): https://github.com/hpyproject/top4000-pypi-packages There are also online services for code search: * GitHub: https://github.com/search * https://grep.app/ (I didn't try it yet) * Debian: https://codesearch.debian.net/ (1) Dowload Script: https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py Usage: download_pypi_top.py PATH It uses this JSON file: https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.jso... From this service: https://hugovk.github.io/top-pypi-packages/ At December 1, on 5000 projects, it only downloads 4760 tarball and ZIP archives: I guess that 240 projects don't provide a source archive. It takes around 5,2 GB of disk space. (2) Code search First, I used the fast and nice "ripgrep" tool with the command "rg -zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball archives). But it doesn't show the path inside the archive and it searchs in files generated by Cython whereas I wanted to ignore these files. So I wrote a short Python script which decompress tarball and ZIP archive in memory and looks for a regex: https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py Usage: search_pypi_top.py "REGEX" output_filename The code to parse command line option is hardcoded and pypi_dir = "PYPI-2021-12-01-TOP-5000" are hardcoded :-D It ignores files generated by Cython and .so binary files (Linux dynamic libraries). While "rg" is very fast, my script is very slow. But I don't care, once the regex is written, I only need to search for the regex once, I can wait 10-15 min ;-) I prefer to wait longer and have a more accurate result. Also, there is room for enhancement, like running multiple jobs in different processes or threads. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Fri, 2021-12-03 at 00:44 +0100, Victor Stinner wrote:
I wrote two scripts based on the work of INADA-san's work to (1) download the source code of the PyPI top 5000 projects (2) search for a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API change to estimate how many projects are impacted.
Am I correct that this script downloads only the newest version for each package? It might be worth to add a disclaimer that since many Python packages pin their dependencies to old versions, you are quite likely to miss impact on projects that are using the deprecated API in old versions that are still used because of their reverse dependencies. -- Best regards, Michał Górny
Hi, You're correct that the download_pypi_top.py script only downloads the latest version. I'm looking for projects impacted by incompatible changes. If the latest version is fine, a project just has to update its dependencies. If the latest version has an issue, it's very likely that old versions are also affected. Victor On Fri, Dec 3, 2021 at 8:35 AM Michał Górny <mgorny@gentoo.org> wrote:
On Fri, 2021-12-03 at 00:44 +0100, Victor Stinner wrote:
I wrote two scripts based on the work of INADA-san's work to (1) download the source code of the PyPI top 5000 projects (2) search for a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API change to estimate how many projects are impacted.
Am I correct that this script downloads only the newest version for each package? It might be worth to add a disclaimer that since many Python packages pin their dependencies to old versions, you are quite likely to miss impact on projects that are using the deprecated API in old versions that are still used because of their reverse dependencies.
-- Best regards, Michał Górny
-- Night gathers, and now my watch begins. It shall not end until my death.
FTR, I don't consider the top projects on PyPI to be representative of our user base, and *especially* not representative of people compiling native modules. This is not a good way to evaluate the impact of breaking changes. It would be far safer to assume that every change is going to break someone and evaluate: * how will they find out that upgrading Python will cause them to break * how will they find out where that break occurs * how will they find out how to fix it * how will they manage that fix across multiple releases * how will we explain that upgrading and fixing breaks is better for *them* than staying on the older version This last one is particularly important, as many large organisations (anecdotally) seem to have settled on Python 3.7 for a while now. Inevitably, this means they're all going to be faced with a painful time when it comes to an upgrade, and every little change we add on is going to hurt more. Every extra thing that needs fixing is motivation to just rewrite in a new language with more hype (and the promise of better compatibility... which I won't comment specifically on, but I suspect they won't manage it any better than us ;) ). This is not the case for the top PyPI projects. They incrementally update and crowdsource fixes, often from us. The pain is distributed to the level of permanent background noise, which sucks in its own way, but is ultimately not representative of much of our user base. So by all means, use this tool for checking stuff. But it's not a substitute for justifying every incompatible change in its own right. /rant Cheers, Steve On 12/2/2021 11:44 PM, Victor Stinner wrote:
Hi,
I wrote two scripts based on the work of INADA-san's work to (1) download the source code of the PyPI top 5000 projects (2) search for a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API change to estimate how many projects are impacted.
The HPy project created a Git repository for a similar need (latest update in June 2021): https://github.com/hpyproject/top4000-pypi-packages
There are also online services for code search:
* GitHub: https://github.com/search * https://grep.app/ (I didn't try it yet) * Debian: https://codesearch.debian.net/
(1) Dowload
Script: https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
Usage: download_pypi_top.py PATH
It uses this JSON file: https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.jso...
From this service: https://hugovk.github.io/top-pypi-packages/
At December 1, on 5000 projects, it only downloads 4760 tarball and ZIP archives: I guess that 240 projects don't provide a source archive. It takes around 5,2 GB of disk space.
(2) Code search
First, I used the fast and nice "ripgrep" tool with the command "rg -zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball archives). But it doesn't show the path inside the archive and it searchs in files generated by Cython whereas I wanted to ignore these files.
So I wrote a short Python script which decompress tarball and ZIP archive in memory and looks for a regex: https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
Usage: search_pypi_top.py "REGEX" output_filename
The code to parse command line option is hardcoded and pypi_dir = "PYPI-2021-12-01-TOP-5000" are hardcoded :-D
It ignores files generated by Cython and .so binary files (Linux dynamic libraries).
While "rg" is very fast, my script is very slow. But I don't care, once the regex is written, I only need to search for the regex once, I can wait 10-15 min ;-) I prefer to wait longer and have a more accurate result. Also, there is room for enhancement, like running multiple jobs in different processes or threads.
Victor
Hi Steve, I completely agree with all you said ;-) I will not debate here if incompatible changes are worth it or not, this topic was discussed recently in another thread. On Fri, Dec 3, 2021 at 2:56 PM Steve Dower <steve.dower@python.org> wrote:
FTR, I don't consider the top projects on PyPI to be representative of our user base, and *especially* not representative of people compiling native modules.
This is not a good way to evaluate the impact of breaking changes.
I do not pretend that a code search on PyPI top 5000 projects is only way and is an exhaustive way to measure the impact of incompatible changes. I'm only trying to advertize that *there is one practical tool* which is better than nothing. Last years, I saw many incompatible changes introduced in Python without: * estimating how many projects: "release Python and pray" in the hope that only a minority is impacted * don't document the change at all, or just say that it's now broken, but it was rare that practical instructions were provided to explain how to port code and how to keep support for old Python versions. I saw a net enhancement recently. Better documentation, core devs proactive to fix impacted projects, better communication to announce incompatible changes in advance, and practical instructions to port code without losing support for old Python versions.
It would be far safer to assume that every change is going to break someone and evaluate: * how will they find out that upgrading Python will cause them to break * how will they find out where that break occurs * how will they find out how to fix it * how will they manage that fix across multiple releases * how will we explain that upgrading and fixing breaks is better for *them* than staying on the older version
In the PEP 674, I wrote an explicit section "Port C extensions to Python 3.11": https://www.python.org/dev/peps/pep-0674/#port-c-extensions-to-python-3-11 It doesn't cover all your questions, but it tries to reply to most of them. I'm open to suggestions to enhance this section ;-) IMO it's a good practice that a PEP introducing incompatible changes explains how to port existing code and this practice should become more common ;-)
* how will we explain that upgrading and fixing breaks is better for *them* than staying on the older version
This part is always the hardest :-( Staying at an old Python version is usually cheaper: no further developments needed. There are still companies using Python 2 nowadays. Don't underestimate the technical debt and the cost to upgrade ;-) For the PEP 674, the promise is that updated C extensions should work better with HPy and GraalPython. Not sure if it's enough to motivate developers to port their code. IMO one important thing is the cost of upgrading a C extension. For the PEP 674, all you need to do is to run a single command once! => ./upgrade_pythoncapi.py path/to/project/ Done! It reminds me the Python 2 to Python 3 migration before 2to3 and six were usable and popular. The migration was super painful and so nobody wanted to do it because everybody wanted to still keep support for Python 2. Only adding Python 3 support didn't bring any benefit in the short term (Python 3 only features couldn't be used). People didn't migrate because migrating code was dangerous, painful and complicated. I'm now in favor of limiting the number of incompatible changes per Python release and never again do a Python 4 "break the world" release. I prefer to have a bunch of incompatible changes in each Python release :-)
This last one is particularly important, as many large organisations (anecdotally) seem to have settled on Python 3.7 for a while now. Inevitably, this means they're all going to be faced with a painful time when it comes to an upgrade, and every little change we add on is going to hurt more. Every extra thing that needs fixing is motivation to just rewrite in a new language with more hype (and the promise of better compatibility... which I won't comment specifically on, but I suspect they won't manage it any better than us ;) ).
IMO we need to invest more time on developing tools to ease the migration to newer Python versions, like: * Python: https://github.com/asottile/pyupgrade * C code: https://github.com/pythoncapi/pythoncapi_compat Victor -- Night gathers, and now my watch begins. It shall not end until my death.
It's really great to see data being gathered on the impact of changes. As we've already seen in this thread, there are many suggestions for how to gather more data and thoughts on how the methodology might be enhanced -- and these suggestions are great -- but just having a means to gather some important data is an excellent step. Anecdotes and people's mental models of the Python ecosystem are certainly valuable, but by themselves they don't provide a way to improve our joint view of the consequences of particular changes. As with unit tests and static analysis we should not expect such data gathering to provide complete proof that a change is okay to make, but having *some* quantitative data and the idea that we should pay attention to this data are definitely a big step forward. - Simon
participants (4)
-
Michał Górny -
Simon Cross -
Steve Dower -
Victor Stinner