On 2018-09-04 11:40:17 -0500 (-0500), Dustin Ingram wrote:
On Tue, Sep 4, 2018 at 11:33 AM Jeremy Stanley <fungi@yuggoth.org> wrote:
Yes. If you haven't tried running a mirror of PyPI lately you're likely not to have noticed, but the various nightly builds for tensorflow seem to be the majority of the data on PyPI now. I'm sure it's a very neat and useful tool, but we basically had to switch from mirroring PyPI in our CI system to using a caching proxy because of this.
Side note: PyPI now provides a list of the largest packages by total filesize: https://pypi.org/stats/
Depending on what mirror you're using, you may be able to exclude these packages from your mirror if you don't need them, e.g. for bandersnatch: https://github.com/pypa/bandersnatch/blob/master/docs/filtering_configuratio...
We played whack-a-mole blacklisting some of the largest offenders in our bandersnatch config for a while, but really needed to rebuild the mirror from scratch since there's no easy way to go back and delete the now-blacklisted packages from before the blacklist entries were added (and that's a week+ effort to bootstrap a new mirror these days). In the end we just switched to a caching proxy we already had on hand because it got us most of the benefit of mirroring with a tiny fraction of the disk space, given we use fewer than 1000 packaged Python library dependencies across our CI jobs anyway. -- Jeremy Stanley