[Python-Dev] Re: Tool to search in the source code of PyPI top 5000 projects

3 Dec 2021

      FTR, I don't consider the top projects on PyPI to be representative of 
our user base, and *especially* not representative of people compiling 
native modules.

This is not a good way to evaluate the impact of breaking changes.

It would be far safer to assume that every change is going to break 
someone and evaluate:
* how will they find out that upgrading Python will cause them to break
* how will they find out where that break occurs
* how will they find out how to fix it
* how will they manage that fix across multiple releases
* how will we explain that upgrading and fixing breaks is better for 
*them* than staying on the older version

This last one is particularly important, as many large organisations 
(anecdotally) seem to have settled on Python 3.7 for a while now. 
Inevitably, this means they're all going to be faced with a painful time 
when it comes to an upgrade, and every little change we add on is going 
to hurt more. Every extra thing that needs fixing is motivation to just 
rewrite in a new language with more hype (and the promise of better 
compatibility... which I won't comment specifically on, but I suspect 
they won't manage it any better than us ;) ).

This is not the case for the top PyPI projects. They incrementally 
update and crowdsource fixes, often from us. The pain is distributed to 
the level of permanent background noise, which sucks in its own way, but 
is ultimately not representative of much of our user base.

So by all means, use this tool for checking stuff. But it's not a 
substitute for justifying every incompatible change in its own right.

/rant

Cheers,
Steve

On 12/2/2021 11:44 PM, Victor Stinner wrote:
...
Hi,
I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.
The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages
There are also online services for code search:
* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/
(1) Dowload
Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
Usage: download_pypi_top.py PATH
It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.jso...
From this service:
https://hugovk.github.io/top-pypi-packages/
At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.
(2) Code search
First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.
So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
Usage: search_pypi_top.py "REGEX" output_filename
The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D
It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).
While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.
Victor

[Python-Dev] Re: Tool to search in the source code of PyPI top 5000 projects

Steve Dower