[Distutils] PEP 527 - Removing Un(der)used file types/extensions on PyPI

Donald Stufft donald at stufft.io
Tue Aug 23 12:46:51 EDT 2016


Since it seemed like there was enough here for a proper PEP I went ahead and
write one up, which is now PEP 527. The tl;dr of it is that:

* Everything but sdist, bdist_wheel, and bdist_egg get deprecated.
* The only allowed extension for sdist is ``.tar.gz``.
* Phased in Deprecation.

I've included the text below, or you can see it online at
https://www.python.org/dev/peps/pep-0527/ once the PEP website is updated.


---------------------------------------------


Abstract
========

This PEP recommends deprecating, and ultimately removing, support for uploading
certain unused or under used file types and extensions to PyPI. In particular
it recommends disallowing further uploads of any files of the types
``bdist_dumb``, ``bdist_rpm``, ``bdist_dmg``, ``bdist_msi``, and
``bdist_wininst``, leaving PyPI to only accept new uploads of the ``sdist``,
``bdist_wheel``, and ``bdist_egg`` file types.

In addition, this PEP proposes removing support for new uploads of sdists using
the ``.tar``, ``.tar.bz2``, ``.tar.xz``, ``.zip``, ``.tar.Z``, ``.tgz``,
``.tbz``, and any other extension besides ``.tar.gz``.



Rationale
=========

File Formats
------------

Currently PyPI supports the following file types:

* ``sdist``
* ``bdist_wheel``
* ``bdist_egg``
* ``bdist_wininst``
* ``bdist_msi``
* ``bdist_dmg``
* ``bdist_rpm``
* ``bdist_dumb``

However, these different types of files have varying amounts of usefulness or
general use in the ecosystem. Continuing to support them adds a maintenance
burden on PyPI as well as tool authors and incurs a cost in both bandwidth and
disk space not only on PyPI itself, but also on any mirrors of PyPI.

bdist_dumb
~~~~~~~~~~

As it's name implies, ``bdist_dumb`` is not a very complex format, however it
is so simple as to be worthless for actual usage.

For instance, if you're using something like pyenv on macOS and you're building
a library using Python 3.5, then ``bdist_dumb`` will produce a ``.tar.gz`` file
named something like ``exampleproject-1.0.macosx-10.11-x86_64.tar.gz``. Right
off the bat this file name is somewhat difficult to differentiate from an
``sdist`` since they both use the same file extension (and with the legacy pre
PEP 440 versions, ``1.0-macosx-10.11-x86_64`` is a valid, although quite silly,
version number). However, once you open up the created ``.tar.gz``, you'd find
that there is no metadata inside that could be used for things like dependency
discovery and in fact, it is quite simply a tarball containing hardcoded paths
to wherever files would have been installed on the computer creating the
``bdist_dumb``. Going back to our pyenv on macOS example, this means that if I
created it, it would contain files like:

``Users/dstufft/.pyenv/versions/3.5.2/lib/python3.5/site-packages/example.py``


bdist_rpm
~~~~~~~~~

The ``bdist_rpm`` format on PyPI allows people to upload ``.rpm`` files for
end users to manually download by hand and then manually install by hand.
However, the common usage of ``rpm`` is with a specially designed repository
that allows automatic installation of dependencies, upgrades, etc which PyPI
does not provide. Thus, it is a type of file that is barely being used on PyPI
with only ~460 files of this type having been uploaded to PyPI (out a total of
662,544).

In addition, services like `COPR <https://copr.fedorainfracloud.org/>`_ provide
a better supported mechanism for publishing and using RPM files than we're ever
likely to get on PyPI.


bdist_dmg, bdist_msi, and bdist_wininst
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``bdist_dmg``, ``bdist_msi``, and ``bdist_winist`` formats are similar in
that they are an OS specific installer that will only install a library into an
environment and are not designed for real user facing installs of applications
(which would require things like bundling a Python interpreter and the like).

Out of these three, the usage for ``bdist_dmg`` and ``bdist_msi`` is very low,
with only ~500 ``bdist_msi`` files and ~50 ``bdist_dmg`` files having been
uploaded to PyPI. The ``bdist_wininst`` format has more use, with ~14,000 files
having ever been uploaded to PyPI.

It's quite easy to look at the low usage of ``bdist_dmg`` and ``bdist_msi`` and
conclude that removing them will be fairly low impact, however
``bdist_wininst`` has several orders of magnitude more usage. This is somewhat
misleading though, because although it has more people *uploading* those files
the actual usage of those uploaded files is fairly low. Taking a look at the
previous 30 days, we can see that 90% of all downloads of ``bdist_winist``
files from PyPI were generated by the mirroring infrastructure and 7% of them
were generated by setuptools (which can currently be better covered by
``bdist_egg`` files).

Given the small number of files uploaded for ``bdist_dmg`` and ``bdist_msi``
and that ``bdist_wininst`` is largely existing to either consume bandwidth and
disk space via the mirroring infrastructure *or* could be trivially replaced
with ``bdist_egg``, this PEP proposes to include these three formats in the
list of those to be disallowed.


File Extensions
---------------

Currently ``sdist`` supports a wide variety of file extensions like `.tar.gz``,
``.tar``, ``.tar.bz2``, ``.tar.xz``, ``.zip``, ``.tar.Z``, ``.tgz``, and
``.tbz``. However, of those the only extensions which get anything more than
negligable usage is ``.tar.gz`` with 444,338 sdists currently, ``.zip`` with
58,774 sdists currently, and ``.tar.bz2`` with 3,265 sdists currently.

Having multiple formats accepted requires tooling both within PyPI and outside
of PyPI to handle all of the various extensions that *might* be used (even if
nobody is currently using them). This doesn't only affect PyPI, but ripples out
throughout the ecosystem. In addition, the different formats all have different
requirements for what optional C libraries Python was linked against and
different requirements for what versions of Python they support. In addition,
multiple formats also create a weird situation where there may be two
``sdist`` files for a particular project/release with subtly different content.

It's easy to advocate that anything outside of ``.tar.gz``, ``.zip``, and
``.tar.bz2`` should be disallowed. Outside of a tiny handful, nobody has
actively been uploading these other types of files in the ~15 years of PyPI's
existence so they've obviously not been particularly useful. In addition, while
``.tar.xz`` is theoretically a nicer format than the other ``.tar.*`` formats
due to the better compression ratio achieved by LZMA, it is only available in
Python 3.3+ and has an optional dependency on the lzma C library.

Looking at the three extensions we *do* have in current use, it's also fairly
easy to conclude that ``.tar.bz2`` can be disallowed as well. It has a fairly
small number of files ever uploaded with it and it requires an additional
optional C library to handle the bzip2 compression.

Finally we get down to ``.tar.gz`` and ``.zip``. Looking at the pure numbers
for these two, we can see that ``.tar.gz`` is by far the most uploaded format,
with 444,338 total uploaded compared to ``.zip``'s 58,774 and on POSIX
operating systems ``.tar.gz`` is also the default produced by all currently
released versions of Python and setuptools. In addition, these two file types
both use the same C library (``zlib``) which is also required for
``bdist_wheel`` and ``bdist_egg``. The two wrinkles with deciding between
``.tar.gz`` and ``.zip`` is that while on POSIX operating systems ``.tar.gz``
is the default, on Windows ``.zip`` is the default and the ``bdist_wheel``
format also uses zip.

This PEP proposes that we drop the use of ``.zip`` extensions for sdists on
PyPI and standardize around ``.tar.gz``. For both extensions there are going to
be automation designed by end users which are making assumptions about what the
file extension produced by the ``sdist`` command will be. Changing either
default will break some number of those, so by changing the default of ``.zip``
to ``.tar.gz`` we minimize the amount of breakage by taking the smaller number
of users and making them match the larger number. In addition, it's more likely
to see Windows users upgrade their setuptools and Python releases on a faster
timescale than POSIX users. POSIX users often get their Python and setuptools
from their OS vendor and are discouraged or actively prevented from upgrading
them outside of complete OS upgrades while Windows users *must* install Python
and setuptools on their own, and thus are more able to upgrade those pieces
without triggering a complete OS upgrade.

While it is true that switching to ``.zip`` would align ``sdist`` with
``bdist_wheel`` in terms of format, this is not a very large benefit because
both formats are able to be manipulated with the Python standard library just
as easily and both require the same C library (``zlib``). It is also true that
Windows has support for ``.zip`` files out of the box but requires third party
software for ``.tar.gz``, however only 0.6% of downloads for sdists on PyPI are
initiated by browsers and we can assume that only a fraction of those 0.6% are
Windows users who want to manually extract the file and do not have a means of
extracting a ``.tar.gz``, particularly since Python itself can be used to
extract a ``.tar.gz`` via the command line since version 3.4. In addition, the
use of ``.tar.gz`` will result in smaller sdists which will reduce the amount
of bandwidth and disk space consumed by ``sdist`` files.


Removal Process
===============

This PEP does **NOT** propose removing any existing files from PyPI, only
disallowing new ones from being uploaded. This restriction will be phased in on
a per-project basis to allow projects to adjust to the new restrictions where
applicable.

First, any *existing* projects will be flagged to allow legacy file types to be
uploaded, and any project without that flag (i.e. new projects) will not be
able to upload anything but ``sdist`` with a ``.tar.gz`` extension,
``bdist_wheel``, and ``bdist_egg``. Then, any existing projects that have never
uploaded a file that requires the legacy file type flag will have that flag
removed, also making them fall under the new restrictions. Finally, an email
will be generated to the maintainers of all projects still given the legacy
flag, which will inform them of the upcoming new restrictions on uploads and
tell them that these restrictions will be applied to future uploads to their
projects starting in 1 month. This email should also contain work arounds for
older versions of Python/setuptools on Windows, to get a ``.tar.gz`` by
default. Finally, after 1 month all projects will have the legacy file type
flag removed, and support for uploading these types of files will cease to
exist on PyPI.

This plan should provide minimal disruption since it does not remove any
existing files, and the types of files it does prevent from being uploaded are
either not particularly useful (or used) types of files *or* they can continue
to upload a similar type of file with a slight change to their process.


—
Donald Stufft





More information about the Distutils-SIG mailing list