[Distutils] GSoC 2017 - Request for Comments on Proposal

Pradyun Gedam pradyunsg at gmail.com
Sat Mar 25 11:44:19 EDT 2017


Hello Everyone!

I had previously sent a mail on this list, stating that I would like to
work on pip's dependency resolution for my GSoC 2017. I now have drafted a
proposal for the same; with help from my mentors - Donald Stufft and Justin
Cappos. I'd also take this opportunity to thank them for agreeing to be my
mentors for this GSoC.

I would like to request for comments on my proposal for GSoC - it is hosted
at https://gist.github.com/pradyunsg/5cf4a35b81f08b6432f280aba6f511eb.

Please find trailing a MarkDown version of the proposal.

Thanks,
Pradyun Gedam

-----

# Adding Proper Dependency Resolution to pip

- **Name:** Pradyun S. Gedam
- **Email:** [pradyunsg at gmail.com][mailto-email]
- **Github:** [pradyunsg][github-profile]
- **University:** [VIT University, Vellore, India][vit-homepage]
- **Course:** Bachelor of Technology in Computer Science and Engineering
- **Course Term:** 2016/17 - 2019/20 (4 Year)
- **Timezone:** IST (GMT +5:30)
- **GSoC Blog RSS Feed URL:** <
https://pradyunsg.github.io/gsoc-2017/feed.xml>

[github-profile]: http://github.com/pradyunsg/
[vit-homepage]: http://vit.ac.in/
[mailto-email]: mailto:pradyunsg at gmail.com

## About Me

I was introduced to Python about five years ago, through "Core Python
Programming" by Weasley J Chun. After the initial struggle with Python 2/3,
the
ball was set rolling and hasn't stopped since. I have fiddled around with
Game Programming (PyGame), Computer Vision (OpenCV), Data Analytics (Pandas,
SciPy, NumPy), transcompilers (Py2C) and more.

As a high school student, I did internship at Enthought in 2013 and 2014.
The two summers that I spent at Enthought were a great learning experience
and
I thoroughly enjoyed the environment there.

Other than Python, I have also used C, C++, Web Technologies (JavaScript,
HTML,
CSS) and preprocessors (Pug, TypeScript, LESS/SASS/SCSS), Java and Bash/Zsh
for
some other projects.

Curious to understand how pip works, I began looking into pip's source code.
I started contributing to pip in May 2016. I am now fairly familiar with the
design of pip and have a fair understanding of how it works, due to the
various
contributions I have made to pip in the past year.

## Mentors

- Donald Stufft (@dstufft on GitHub)
- Justin Cappos (@JustinCappos on GitHub)

Communication with the mentors will be done mostly on issues and pull
requests
on pip's GitHub repository. If at any point in time, a real time discussion
is
to be done with the mentors, it would be done over IRC or Skype. Email can
also
be used if needed.

## Proposal

This project is regarding improving dependency resolution performed within
pip
by implementing a dependency resolver within it.

### Abstract

Currently, pip does not resolve dependencies correctly when there are
conflicting requirements. The lack of dependency resolution has caused
hard-to-debug bugs/failures due to the installation of incompatible
packages.
The lack of a dependency resolver is also a blocker for various other
features -
adding an upgrade-all functionality to pip and properly determining
build-time
dependencies for packages are two such features.

### Deliverables

At the end of this project, pip will have the ability to:

- detect requirement conflicts
- resolve conflicting dependency requirements where possible

### Implementation

The implementation language will be Python. The code will maintain the
compatibility requirements of pip - the same source code will support the
multiple Python implementations and version, including but not limited to,
CPython 2.7, CPython 3.3+, PyPy 2, PyPy3.

New Tests for the functionality introduced will be added to pip's current
test
suite.

User documentation would not be a major part of this project. The only
changes
would be to mention pip can now resolve dependencies properly. There are
certain
sections that might need updating.

#### Overview

The project will be composed of the following stages:

1. Refactor the dependency resolution logic of pip into a separate module.
1. Implement dependency information caching in pip.
1. Implement a backtracking dependency resolver to resolve the dependencies.

Every stage depends on the previous ones being completed. This step-wise
approach would make incremental improvements so that there is a constant
feedback on the work being done as well as making it easy to change course
without losing existing work; if needed for some unforeseen reason.

#### Discussion

There is a class in pip - `RequirementSet`, which is currently a God class.
It
is responsible for the following:

  1. Holding a list of things to be installed
  1. Downloading Files
  1. Dependency Resolution
  1. Unpacking downloaded files (preparation for installation)
  1. Uninstalling packages
  1. Installing packages

This is clearly a bad situation since this is most of the heavy lifting
involved in pip. These responsibilities need to be separated and moved out
into
their independent modules/classes, to allow for simplification of
`RequirementSet` while providing a clean platform for the remaining work.
This is the most tricky portion of this project, given the complexity of
`RequirementSet` as it stands today.

There are two kinds of distributions that may be used to install a package -
wheels (binary) and sdists (source). When installing a package, pip builds a
wheel from an sdist and then proceeds to install the wheel. The difference
between the two formats of distribution relevant to this project is - wheels
store the information about dependencies within them statically; sdists do
not.

Determining the dependencies of a wheel distribution is merely the matter of
fetching the information from a METADATA file within the `.whl` file. The
dependency information for an sdist, on the other hand, can only be
determined
after running its `setup.py` file on the target system. This means that
dependencies of an sdist depend on how its `setup.py` behaves which can
change
due to variations on target systems or could even contain through random
choices.

Computation of an sdist's dependencies on the target system is a
time-consuming
task since it potentially involves fetching a package from PyPI and
executing
it's setup.py to get the dependency information. In order to improve
performance, once an sdist's dependencies are computed, they would be
stored in
a cache so that during dependency resolution, the dependencies of an sdist
are
not computed every time they are needed. Further, pip caches wheels it
downloads or builds meaning that any installed package or downloaded wheel's
dependency information would available statically, without the need to go
through the sdist dependency cache.

Like the wheel cache, sdist-dependency-cache will be a file system based
cache.
The sdist-dependency-cache would only be used if the corresponding sdist is
being used.

Since sdist dependencies are non-deterministic, the cached dependency
information is potentially incorrect - in certain corner cases such as using
random choices in setup.py files. Such uses are not seen as important
enough to
cater to, compared the benefits of having a cache. Further, this is already
the
case with the information in the wheel cache.

SAT solver based resolution is not feasible for pip since a SAT solver needs
the entire set of packages and their dependencies to compute a solution,
which
cannot be generated due to the aforementioned non-deterministic behaviour of
setup.py file. Computing dependency information for all of PyPI on a target
system for installing a package is simply not feasible.

The most reasonable approach is using a backtracking solver. Such a solver
can
be incrementally provided information about the dependencies of a package
and
would only need dependency information about packages in the dependency
graph
of the current system.

There is a need to keep a cache of visited packages during dependency
resolution. A certain package-version combination may be reached via
multiple
paths and it is an inefficient use of computation time to re-compute that
whether it is indeed going to satisfy the requirements or not. By storing
information about which package-version combinations have been visited and
do
(or do not) satisfy the constraints, it is possible to speedup the
resolution.

Consider the following example:

```
A-1.0 (depends: B)
A-2.0 (depends: B and E-1.0)
B-1.0 (depends: C and D)
C-1.0 (depends: D)
D-1.0
D-1.1 (depends: E-2.0)
E-1.0
E-2.0
```

If an installation of A is required, either A-2.0 or D-1.1 should not be
installed because they have a conflicting requirement - E. While there are
multiple possible solutions to this situation, the "most obvious" one us to
use
the D-1.0, instead of not installing A-2.0. Further, as multiple packages
depend on D, the algorithm would "reach it" multiple times. By maintaining a
cache for the visited packages, it is possible to achieve a speedup in such
a
scenario.

#### Details

Pull requests would be made on a regular basis during the project to ensure
that the feedback loop is quick. This also reduces the possibilities of
conflicts due to unrelated changes in pip.

All the code will be tested within pip's existing testing infrastructure. It
has everything needed to write tests related to all the changes to be made.
Every PR made to pip as a part of this project will contain new tests or
modifications to existing ones as needed.

##### Stage 1

Initially, some abstractions will be introduced to the pip codebase to
improve
the reuse of certain common patterns within the codebase. This includes
cleaner
temporary directory management through a `TempDirectory` class.

`RequirementSet.prepare_files` and `RequirementSet._prepare_file` are
downloading, unpacking packages as well as doing what pip does as dependency
resolution. Taking these functions apart neatly is going to be a tricky
task.

The following is a listing of the final modules that will be responsible for
the various tasks that are currently being performed by `RequirementSet`:

- `pip.resolve` - Dependency Resolution
- `pip.download` - Downloading Files & Unpacking downloaded files
- `pip.req.req_set` - Holding a list of things to be installed / uninstalled
- `pip.operations.install` - Installing Packages (preparation for
installation)
- `pip.operations.uninstall` - Uninstalling Packages

To be able to proceed to the next step, only the dependency resolution
related
code needs to be refactored into a separate module. Other portions of
`RequirementSet` are not required to be refactored but the same will be
tackled
as optional deliverables. In other words, only `pip.resolve` needs to be
completed to be able to proceed to the next stage in this project. This is
needed since in Stage 3, the resolver would be written in `pip.resolve`,
independent of the rest of the codebase.

##### Stage 2

A new module `pip.cache` will be created. Within this module, all the
caching
will be handled. Thus, the code for the current wheel cache would be moved.
The new code for a dependency cache for sdists would also be written here.

The new cache would hold all of an sdist's egg-info. The information will be
stored on the file-system in a sub directory structure much like that of the
wheel cache, in a directory structure based on hash of the sdist file
holding
the egg-info at the end. This will be done in a class named `EggInfoCache`.

`EggInfoCache` cache will be used only if a corresponding wheel for an
sdist is
not available. Installing an sdist results in the creation of a wheel which
contains the dependency information, which would be used over the
information
available in the `EggInfoCache`.

To be able to proceed to the next step, it is required that `EggInfoCache`
is
implemented.

##### Stage 3

The module `pip.resolve` will be modified and a class named
`BacktrackingResolver` will be added to it. The class does what you expect
it
to do - it would resolve dependencies with recursive backtracking. As
described
above, there will be some global state stored about the packages that have
been
explored. Other than the maintenance of global state, in the form of the
cache,
the rest of the algorithm will essentially follow the same structure as any
backtracking algorithm.

The project would be complete when the aforementioned resolver is
implemented.

#### Existing Work

There is existing work directly related to dependency resolution in pip,
done by
multiple individuals.

- Robert Collins (un-merged closed pull request on pypa/pip)

  This has an incomplete backtracking dependency resolver and dependency
  caching.

- Sebastien Awwad (branch on a separate fork)

  This was used for the depresolve project, to investigate the state of
  Dependency Resolution in PyPI/pip ecosystem.

- `pip-compile` (separate project)

  This has a backtracking dependency resolver implemented to overcome pip's
  inablity to resolve dependencies.

Their work would be used for reference, where appropriate, during the
course of
the project. Further, there are many package managers which implement
dependency
resolution, which would also be looked into.

### Tentative Timeline

- Community Bonding Period: **5 May - 29 May**

  - Clean up and finalize my existing pull requests.
  - Read existing literature regarding dependency resolution.
  - Inspect existing implementations of dependency resolvers.

  GOAL: Be ready for the work coming up.

- Week 1: **30 May - 5 June**

  - Introduce abstractions across pip's codebase to make refactoring
    `RequirementSet` easier.
  - Begin breaking down `RequirementSet.prepare_file`.

- Week 2: **6 June - 12 June**

  - Continue working on the refactor of `RequirementSet`.

- Week 3: **13 June - 19 June**

  - Continue working on the refactor of `RequirementSet`.
  - Finish and polish `pip.resolve`.

  GOAL: `pip.resolve` module will be ready, using the current resolution
        strategy.

- Week 4: **20 June - 26 June**

  - Finish and polish all work on the refactor of `RequirementSet`.

- Week 5: **27 June - 3 July**

  - Move code for `WheelCache` into a new `pip.cache` module.
  - Write tests for `pip.cache.EggInfoCache`, based on `WheelCache`.
  - Begin implementation of `pip.cache.EggInfoCache`.

- Week 6: **4 July - 10 July**

  - Finish and polish `pip.cache.EggInfoCache`.

  GOAL: A cache for storing dependency information of sdists would be ready
to
        add to pip.

- Week 7: **10 July - 16 July**

  - Create a comprehensive set of tests for the dependency resolver.
    (in `tests/unit/test_resolve.py`)
  - Begin implementation on the backtracking algorithm.

  GOAL: A comprehensive test bed is ready for testing the dependency
resolver.

- Week 8: **17 July - 24 July**

  - Complete a rough implementation of the backtracking algorithm

  GOAL: An implementation of a dependency resolver to begin running tests on
  and work on improving.

- Week 9: **25 July - 31 July**

  - Fixing bugs in dependency resolver

- Week 10: **1 August - 6 August**

  - Finish and polish work on dependency resolver

  GOAL: A ready-to-merge PR adding a backtracking dependency resolver for
pip.

- Week 11: **6 August - 13 August**

  Buffer Week.

- Week 12: **13 August - 21 August**

  Buffer Week. Finalization of project for evaluation.

If the deliverable is achieved ahead of schedule, the remaining time will be
utilized to resolve open issues on pip's repository in the order of
priority as
determined under the guidance of the mentors.

#### Other Commitments

I expect to not be able to put in 40 hour/week for at most 3 weeks
throughout
the working period, due to the schedule of my university.

I will have semester-end examinations - from 10th May 2017 to 24th May 2017
-
during the Community Bonding Period. My university will re-open for my third
semester on 12 July 2017. I expect mid-semester examinations to be held in
my
University around 20th August 2017. During these times, I would not be able
to
put in full 40 hour weeks due to the academic workload.

I might take a 3-4 day break during this period, regarding which I would be
informing my mentor around a week in advance.

I will be completely free from 30th May 2017 to 11 July 2017.

### Future Work

There seems to be some interest in being able to reuse the above dependency
resolution algorithm in other packaging related tools, specifically from the
buildout project. I intend to eventually move the dependency resolution code
that would come out of this project into a separate library to allow for
reuse
by installer projects - pip, buildout and other tools.

## Previous Contributions to pip

(As on 12th March 2017)

### Issues

Authored:

- #3785 - Prefering wheel-based installation over source-based installation
(Open)
- #3786 - Make install command upgrade packages by default (Closed)
- #3787 - Check if pip broke the dependency graph and warn the user (Open)
- #3807 - Tests fail since change on PyPI (Closed)
- #3809 - Switch to TOML for configuration files (Open)
- #3871 - Provide a way to perform non-eager upgrades (Closed)
- #4198 - Travis CI - pypy broken dues to dependency change in pycrypto
(Closed)
- #4282 - What's the release schedule? (Closed)

Participated:

- #59 - Add "upgrade" and "upgrade-all" commands (Open)
- #988 - Pip needs a dependency resolver (Open)
- #1056 - pip install -U does not remember whether a package was installed
with --user (Open)
- #1511 - ssl certificate hostname mismatch errors presented badly (Open)
- #1668 - Default to --user  (Open)
- #1736 - Create a command to make it easy to access the configuration file
(Open)
- #1737 - Don't tell the user what they meant, just do what they meant
(Open)
- #2313 - Automated the Creation and Upload of Release Artifacts (Open)
- #2732 - pip install hangs with interactive setup.py setups (Open)
- #3549 - pip -U pip fails (Open)
- #3580 - Update requests/urllib3 (Closed)
- #3610 - pip install from package from github, with github dependencies
(Open)
- #3788 - `pip` version suggested is older than the version which is
installed (Closed)
- #3789 - Error installing Mayavi in Mac OS X  (Closed)
- #3798 - On using python -m pip install -upgrade pip.. its throwing an
error like the one below (Closed)
- #3811 - no matching distribution found for install (Closed)
- #3814 - pip could not find a version that satisfies the requirement
oslo.context (Closed)
- #3876 - support git refs in @ syntax (Open)
- #4021 - Will you accept PRs with pep484 type hints?  (Open)
- #4087 - pip list produces error (Closed)
- #4149 - Exception thrown when binary is already linked to /usr/local/bin
(Open)
- #4160 - Pip does not seem to be handling deep requirements correctly
(Open)
- #4162 - Let --find-links be context aware to support github, gitlab, etc.
links (Open)
- #4170 - pip list |head raises BrokenPipeError (Open)
- #4182 - pip install should install packages in order to avoid ABI
incompatibilities in compiled  (Open)
- #4186 - IOError: [Errno 13] Permission denied: '/usr/local/bin/pip' (Open)
- #4206 - Where on Windows 10 is pip.conf or pip.ini located? (Closed)
- #4221 - Feature request: Check if user has permissions before downloading
files (Closed)
- #4229 - "pip uninstall" is too noisy (Open)

#### Pull Requests

Authored:

- #3806 - Change install command's default behaviour to upgrade packages by
default (Closed, superseded by #3972)
- #3808 - Fix Tests (Merged)
- #3818 - Improve UX and tests of check command (Merged)
- #3972 - Add an upgrade-strategy option (Merged)
- #3974 - [minor] An aesthetic change to wheel command source (Merged)
- #4192 - Move out all the config code to a separate module (Merged)
- #4193 - Add the ability to autocorrect a user's command (Open)
- #4199 - Fix Tests for Travis CI (Merged)
- #4200 - Reduce the API exposed by the configuration class (Merged)
- #4232 - Update documentation to mention upgrade-strategy (Merged)
- #4233 - Nicer permissions error message (Open)
- #4240 - [WIP] Add a configuration command (Open)

Participated:

- #2716 - Issue #988: new resolver for pip (Closed)
- #2975 - Different mock dependencies based on Python version (Merged)
- #3744 - Add a "Upgrade all local installed packages" (Open)
- #3750 - Add a `pip check` command. (Merged)
- #3751 - tox.ini: Add "cover" target (Open)
- #3794 - Use the canonicalize_name function for finding .dist-info (Merged)
- #4142 - Optionally load C dependencies based on platform (Open)
- #4144 - Install build dependencies as specified in PEP 518 (Open)
- #4150 - Clarify default for --upgrade-strategy is eager (Merged)
- #4194 - Allow passing a --no-progress-bar to the install script to
surpress progress bar (Merged)
- #4201 - Register req_to_install for cleanup sooner (Merged)
- #4202 - Switch to 3.6.0 final as our latest 3.x release (Merged)
- #4211 - improve message when installing requirements file (#4127) (Merged)
- #4241 - Python 3.6 is supported (Merged)

## References

1. [pypa/pip#988](https://github.com/pypa/pip/issues/988)

   Tracking issue for adding a proper dependency resolver in pip. Contains
   links to various useful resources.

1. [pypa/pip#2716](https://github.com/pypa/pip/issues/2716)

   Prior work by Robert Collins for adding a proper dependency resolver in
pip.

1. [Python Dependency Resolution](
https://docs.google.com/document/d/1x_VrNtXCup75qA3glDd2fQOB2TakldwjKZ6pXaAjAfg/edit?usp=sharing
)

   A writeup by Sebastian Awwad on the current state of dependency
resolution
   in pip and PyPI in general.

1. [PSF Application Template](
https://wiki.python.org/moin/SummerOfCode/ApplicationTemplate2017)

   For guidance on how to write the application and what information is
needed.

1. [Stork: Secure Package Management for VM Environments](
http://isis.poly.edu/~jcappos/papers/cappos_stork_dissertation_08.pdf)

   A Paper by Justin Cappos about Stork, used for reference regarding
   Backtracking Resolution
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20170325/997ffbcb/attachment-0001.html>


More information about the Distutils-SIG mailing list