FTR, I don't consider the top projects on PyPI to be representative of our user base, and *especially* not representative of people compiling native modules. This is not a good way to evaluate the impact of breaking changes. It would be far safer to assume that every change is going to break someone and evaluate: * how will they find out that upgrading Python will cause them to break * how will they find out where that break occurs * how will they find out how to fix it * how will they manage that fix across multiple releases * how will we explain that upgrading and fixing breaks is better for *them* than staying on the older version This last one is particularly important, as many large organisations (anecdotally) seem to have settled on Python 3.7 for a while now. Inevitably, this means they're all going to be faced with a painful time when it comes to an upgrade, and every little change we add on is going to hurt more. Every extra thing that needs fixing is motivation to just rewrite in a new language with more hype (and the promise of better compatibility... which I won't comment specifically on, but I suspect they won't manage it any better than us ;) ). This is not the case for the top PyPI projects. They incrementally update and crowdsource fixes, often from us. The pain is distributed to the level of permanent background noise, which sucks in its own way, but is ultimately not representative of much of our user base. So by all means, use this tool for checking stuff. But it's not a substitute for justifying every incompatible change in its own right. /rant Cheers, Steve On 12/2/2021 11:44 PM, Victor Stinner wrote:
Hi,
I wrote two scripts based on the work of INADA-san's work to (1) download the source code of the PyPI top 5000 projects (2) search for a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API change to estimate how many projects are impacted.
The HPy project created a Git repository for a similar need (latest update in June 2021): https://github.com/hpyproject/top4000-pypi-packages
There are also online services for code search:
* GitHub: https://github.com/search * https://grep.app/ (I didn't try it yet) * Debian: https://codesearch.debian.net/
(1) Dowload
Script: https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
Usage: download_pypi_top.py PATH
It uses this JSON file: https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.jso...
From this service: https://hugovk.github.io/top-pypi-packages/
At December 1, on 5000 projects, it only downloads 4760 tarball and ZIP archives: I guess that 240 projects don't provide a source archive. It takes around 5,2 GB of disk space.
(2) Code search
First, I used the fast and nice "ripgrep" tool with the command "rg -zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball archives). But it doesn't show the path inside the archive and it searchs in files generated by Cython whereas I wanted to ignore these files.
So I wrote a short Python script which decompress tarball and ZIP archive in memory and looks for a regex: https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
Usage: search_pypi_top.py "REGEX" output_filename
The code to parse command line option is hardcoded and pypi_dir = "PYPI-2021-12-01-TOP-5000" are hardcoded :-D
It ignores files generated by Cython and .so binary files (Linux dynamic libraries).
While "rg" is very fast, my script is very slow. But I don't care, once the regex is written, I only need to search for the regex once, I can wait 10-15 min ;-) I prefer to wait longer and have a more accurate result. Also, there is room for enhancement, like running multiple jobs in different processes or threads.
Victor