[Python-checkins] peps: PEP 426: analyse the PyPI metrics correctly

nick.coghlan python-checkins at python.org
Thu Feb 21 14:37:39 CET 2013


http://hg.python.org/peps/rev/516b67ed1a2d
changeset:   4757:516b67ed1a2d
user:        Nick Coghlan <ncoghlan at gmail.com>
date:        Thu Feb 21 23:37:00 2013 +1000
summary:
  PEP 426: analyse the PyPI metrics correctly

files:
  pep-0426.txt        |  145 +++++++++++++-----
  pep-0426/pepsort.py |  249 ++++++++++++++++++-------------
  2 files changed, 249 insertions(+), 145 deletions(-)


diff --git a/pep-0426.txt b/pep-0426.txt
--- a/pep-0426.txt
+++ b/pep-0426.txt
@@ -44,8 +44,9 @@
 distribution.
 
 This format is parseable by the ``email`` module with an appropriate
-``email.policy.Policy()``.  When ``metadata`` is a Unicode string,
-```email.parser.Parser().parsestr(metadata)`` is a serviceable parser.
+``email.policy.Policy()`` (see `Appendix A`_).  When ``metadata`` is a
+Unicode string, ```email.parser.Parser().parsestr(metadata)`` is a
+serviceable parser.
 
 There are three standard locations for these metadata files:
 
@@ -1358,25 +1359,41 @@
 
 Finally, as the version scheme in use is dependent on the metadata
 version, it was deemed simpler to merge the scheme definition directly into
-this PEP rather than continuing to maintain it as a separate PEP. This will
-also allow all of the distutils-specific elements of PEP 386 to finally be
-formally rejected.
+this PEP rather than continuing to maintain it as a separate PEP.
 
-The following statistics provide an analysis of the compatibility of existing
-projects on PyPI with the specified versioning scheme (as of 16th February,
-2013).
+`Appendix B` shows detailed results of an analysis of PyPI distribution
+version information, as collected on 19th February, 2013. This analysis
+compares the behaviour of the explicitly ordered version schemes defined in
+this PEP and PEP 386 with the de facto standard defined by the behaviour
+of setuptools. These metrics are useful, as the intent of both PEPs is to
+follow existing setuptools behaviour as closely as is feasible, while
+still throwing exceptions for unorderable versions (rather than trying
+to guess an appropriate order as setuptools does).
 
-* Total number of distributions analysed: 28088
-* Distributions with no releases: 248 / 28088 (0.88 %)
-* Fully compatible distributions: 24142 / 28088 (85.95 %)
-* Compatible distributions after translation: 2830 / 28088 (10.08 %)
-* Compatible distributions after filtering: 511 / 28088 (1.82 %)
-* Distributions sorted differently after translation: 38 / 28088 (0.14 %)
-* Distributions sorted differently without translation: 2 / 28088 (0.01 %)
-* Distributions with no compatible releases: 317 / 28088 (1.13 %)
+Overall, the percentage of compatible distributions improves from 97.7%
+with PEP 386 to 98.7% with this PEP. While the number of projects affected
+in practice was small, some of the affected projects are in widespread use
+(such as Pinax and selenium). The surprising ordering discrepancy also
+concerned developers and acted as an unnecessary barrier to adoption of
+the new metadata standard.
+
+The data also shows that the pre-release sorting discrepancies are seen
+only when analysing *all* versions from PyPI, rather than when analysing
+public versions. This is largely due to the fact that PyPI normally reports
+only the most recent version for each project (unless the maintainers
+explicitly configure it to display additional versions). However,
+installers that need to satisfy detailed version constraints often need
+to look at all available versions, as they may need to retrieve an older
+release.
+
+Even this PEP doesn't completely eliminate the sorting differences relative
+to setuptools:
+
+* Sorts differently (after translations): 38 / 28194 (0.13 %)
+* Sorts differently (no translations): 2 / 28194 (0.01 %)
 
 The two remaining sort order discrepancies picked up by the analysis are due
-to a pair of projects which have published releases ending with a carriage
+to a pair of projects which have PyPI releases ending with a carriage
 return, alongside releases with the same version number, only *without* the
 trailing carriage return.
 
@@ -1390,26 +1407,6 @@
 standard scheme will normalize both representations to ".devN" and sort
 them by the numeric component.
 
-For comparison, here are the corresponding analysis results for PEP 386:
-
-* Total number of distributions analysed: 28088
-* Distributions with no releases: 248 / 28088 (0.88 %)
-* Fully compatible distributions: 23874 / 28088 (85.00 %)
-* Compatible distributions after translation: 2786 / 28088 (9.92 %)
-* Compatible distributions after filtering: 527 / 28088 (1.88 %)
-* Distributions sorted differently after translation: 96 / 28088 (0.34 %)
-* Distributions sorted differently without translation: 14 / 28088 (0.05 %)
-* Distributions with no compatible releases: 543 / 28088 (1.93 %)
-
-These figures make it clear that only a relatively small number of current
-projects are affected by these changes. However, some of the affected
-projects are in widespread use (such as Pinax and selenium). The
-changes also serve to bring the standard scheme more into line with
-developer's expectations, which is an important element in encouraging
-adoption of the new metadata version.
-
-The script used for the above analysis is available at [3]_.
-
 
 A more opinionated description of the versioning scheme
 -------------------------------------------------------
@@ -1550,8 +1547,10 @@
 .. [3] Version compatibility analysis script:
    http://hg.python.org/peps/file/default/pep-0426/pepsort.py
 
-Appendix
-========
+Appendix A
+==========
+
+The script used for this analysis is available at [3]_.
 
 Parsing and generating the Metadata 2.0 serialization format using
 Python 3.3::
@@ -1610,6 +1609,74 @@
         # Correct if sys.stdout.encoding == 'UTF-8':
         Generator(sys.stdout, maxheaderlen=0).flatten(m)
 
+Appendix B
+==========
+
+Metadata v2.0 guidelines versus setuptools::
+
+    $ ./pepsort.py
+    Comparing PEP 426 version sort to setuptools.
+
+    Analysing release versions
+      Compatible: 24477 / 28194 (86.82 %)
+      Compatible with translation: 247 / 28194 (0.88 %)
+      Compatible with filtering: 84 / 28194 (0.30 %)
+      No compatible versions: 420 / 28194 (1.49 %)
+      Sorts differently (after translations): 0 / 28194 (0.00 %)
+      Sorts differently (no translations): 0 / 28194 (0.00 %)
+      No applicable versions: 2966 / 28194 (10.52 %)
+
+    Analysing public versions
+      Compatible: 25600 / 28194 (90.80 %)
+      Compatible with translation: 1505 / 28194 (5.34 %)
+      Compatible with filtering: 13 / 28194 (0.05 %)
+      No compatible versions: 420 / 28194 (1.49 %)
+      Sorts differently (after translations): 0 / 28194 (0.00 %)
+      Sorts differently (no translations): 0 / 28194 (0.00 %)
+      No applicable versions: 656 / 28194 (2.33 %)
+
+    Analysing all versions
+      Compatible: 24239 / 28194 (85.97 %)
+      Compatible with translation: 2833 / 28194 (10.05 %)
+      Compatible with filtering: 513 / 28194 (1.82 %)
+      No compatible versions: 320 / 28194 (1.13 %)
+      Sorts differently (after translations): 38 / 28194 (0.13 %)
+      Sorts differently (no translations): 2 / 28194 (0.01 %)
+      No applicable versions: 249 / 28194 (0.88 %)
+
+Metadata v1.2 guidelines versus setuptools::
+
+    $ ./pepsort.py 386
+    Comparing PEP 386 version sort to setuptools.
+
+    Analysing release versions
+      Compatible: 24244 / 28194 (85.99 %)
+      Compatible with translation: 247 / 28194 (0.88 %)
+      Compatible with filtering: 84 / 28194 (0.30 %)
+      No compatible versions: 648 / 28194 (2.30 %)
+      Sorts differently (after translations): 0 / 28194 (0.00 %)
+      Sorts differently (no translations): 0 / 28194 (0.00 %)
+      No applicable versions: 2971 / 28194 (10.54 %)
+
+    Analysing public versions
+      Compatible: 25371 / 28194 (89.99 %)
+      Compatible with translation: 1507 / 28194 (5.35 %)
+      Compatible with filtering: 12 / 28194 (0.04 %)
+      No compatible versions: 648 / 28194 (2.30 %)
+      Sorts differently (after translations): 0 / 28194 (0.00 %)
+      Sorts differently (no translations): 0 / 28194 (0.00 %)
+      No applicable versions: 656 / 28194 (2.33 %)
+
+    Analysing all versions
+      Compatible: 23969 / 28194 (85.01 %)
+      Compatible with translation: 2789 / 28194 (9.89 %)
+      Compatible with filtering: 530 / 28194 (1.88 %)
+      No compatible versions: 547 / 28194 (1.94 %)
+      Sorts differently (after translations): 96 / 28194 (0.34 %)
+      Sorts differently (no translations): 14 / 28194 (0.05 %)
+      No applicable versions: 249 / 28194 (0.88 %)
+
+
 Copyright
 =========
 
diff --git a/pep-0426/pepsort.py b/pep-0426/pepsort.py
--- a/pep-0426/pepsort.py
+++ b/pep-0426/pepsort.py
@@ -20,6 +20,8 @@
 PEP426_VERSION_RE = re.compile('^(\d+(\.\d+)*)((a|b|c|rc)(\d+))?'
                                '(\.(post)(\d+))?(\.(dev)(\d+))?$')
 
+PEP426_PRERELEASE_RE = re.compile('(a|b|c|rc|dev)\d+')
+
 def pep426_key(s):
     s = s.strip()
     m = PEP426_VERSION_RE.match(s)
@@ -60,23 +62,28 @@
 
     return nums, pre, post, dev
 
+def is_release_version(s):
+    return not bool(PEP426_PRERELEASE_RE.search(s))
+
 def cache_projects(cache_name):
     logger.info("Retrieving package data from PyPI")
     client = xmlrpclib.ServerProxy('http://python.org/pypi')
     projects = dict.fromkeys(client.list_packages())
+    public = projects.copy()
     failed = []
     for pname in projects:
-        time.sleep(0.1)
+        time.sleep(0.01)
         logger.debug("Retrieving versions for %s", pname)
         try:
             projects[pname] = list(client.package_releases(pname, True))
+            public[pname] = list(client.package_releases(pname))
         except:
             failed.append(pname)
     logger.warn("Error retrieving versions for %s", failed)
     with open(cache_name, 'w') as f:
-        json.dump(projects, f, sort_keys=True,
+        json.dump([projects, public], f, sort_keys=True,
                   indent=2, separators=(',', ': '))
-    return projects
+    return projects, public
 
 def get_projects(cache_name):
     try:
@@ -84,11 +91,11 @@
     except IOError as exc:
         if exc.errno != errno.ENOENT:
             raise
-        projects = cache_projects(cache_name);
+        projects, public = cache_projects(cache_name);
     else:
         with f:
-            projects = json.load(f)
-    return projects
+            projects, public = json.load(f)
+    return projects, public
 
 
 VERSION_CACHE = "pepsort_cache.json"
@@ -112,109 +119,139 @@
     "426": pep426_key,
 }
 
+class Analysis:
+
+    def __init__(self, title, projects, releases_only=False):
+        self.title = title
+        self.projects = projects
+
+        num_projects = len(projects)
+
+        compatible_projects = Category("Compatible", num_projects)
+        translated_projects = Category("Compatible with translation", num_projects)
+        filtered_projects = Category("Compatible with filtering", num_projects)
+        incompatible_projects = Category("No compatible versions", num_projects)
+        sort_error_translated_projects = Category("Sorts differently (after translations)", num_projects)
+        sort_error_compatible_projects = Category("Sorts differently (no translations)", num_projects)
+        null_projects = Category("No applicable versions", num_projects)
+
+        self.categories = [
+            compatible_projects,
+            translated_projects,
+            filtered_projects,
+            incompatible_projects,
+            sort_error_translated_projects,
+            sort_error_compatible_projects,
+            null_projects,
+        ]
+
+        sort_key = SORT_KEYS[pepno]
+        sort_failures = 0
+        for i, (pname, versions) in enumerate(projects.items()):
+            if i % 100 == 0:
+                sys.stderr.write('%s / %s\r' % (i, num_projects))
+                sys.stderr.flush()
+            if not versions:
+                logger.debug('%-15.15s has no versions', pname)
+                null_projects.add(pname)
+                continue
+            # list_legacy and list_pep will contain 2-tuples
+            # comprising a sortable representation according to either
+            # the setuptools (legacy) algorithm or the PEP algorithm.
+            # followed by the original version string
+            # Go through the PEP 386/426 stuff one by one, since
+            # we might get failures
+            list_pep = []
+            release_versions = set()
+            prerelease_versions = set()
+            excluded_versions = set()
+            translated_versions = set()
+            for v in versions:
+                s = v
+                try:
+                    k = sort_key(v)
+                except Exception:
+                    s = suggest_normalized_version(v)
+                    if not s:
+                        good = False
+                        logger.debug('%-15.15s failed for %r, no suggestions', pname, v)
+                        excluded_versions.add(v)
+                        continue
+                    else:
+                        try:
+                            k = sort_key(s)
+                        except ValueError:
+                            logger.error('%-15.15s failed for %r, with suggestion %r',
+                                         pname, v, s)
+                            excluded_versions.add(v)
+                            continue
+                    logger.debug('%-15.15s translated %r to %r', pname, v, s)
+                    translated_versions.add(v)
+                if is_release_version(s):
+                    release_versions.add(v)
+                else:
+                    prerelease_versions.add(v)
+                    if releases_only:
+                        logger.debug('%-15.15s ignoring pre-release %r', pname, s)
+                        continue
+                list_pep.append((k, v))
+            if releases_only and prerelease_versions and not release_versions:
+                logger.debug('%-15.15s has no release versions', pname)
+                null_projects.add(pname)
+                continue
+            if not list_pep:
+                logger.debug('%-15.15s has no compatible versions', pname)
+                incompatible_projects.add(pname)
+                continue
+            # The legacy approach doesn't refuse the temptation to guess,
+            # so it *always* gives some kind of answer
+            if releases_only:
+                excluded_versions |= prerelease_versions
+            accepted_versions = set(versions) - excluded_versions
+            list_legacy = [(legacy_key(v), v) for v in accepted_versions]
+            assert len(list_legacy) == len(list_pep)
+            sorted_legacy = sorted(list_legacy)
+            sorted_pep = sorted(list_pep)
+            sv_legacy = [t[1] for t in sorted_legacy]
+            sv_pep = [t[1] for t in sorted_pep]
+            if sv_legacy != sv_pep:
+                if translated_versions:
+                     logger.debug('%-15.15s translation creates sort differences', pname)
+                     sort_error_translated_projects.add(pname)
+                else:
+                     logger.debug('%-15.15s incompatible due to sort errors', pname)
+                     sort_error_compatible_projects.add(pname)
+                logger.debug('%-15.15s unequal: legacy: %s', pname, sv_legacy)
+                logger.debug('%-15.15s unequal: pep%s: %s', pname, pepno, sv_pep)
+                continue
+            # The project is compatible to some degree,
+            if excluded_versions:
+                logger.debug('%-15.15s has some compatible versions', pname)
+                filtered_projects.add(pname)
+                continue
+            if translated_versions:
+                logger.debug('%-15.15s is compatible after translation', pname)
+                translated_projects.add(pname)
+                continue
+            logger.debug('%-15.15s is fully compatible', pname)
+            compatible_projects.add(pname)
+
+    def print_report(self):
+        print("Analysing {}".format(self.title))
+        for category in self.categories:
+            print(" ", category)
+
+
 def main(pepno = '426'):
-    sort_key = SORT_KEYS[pepno]
     print('Comparing PEP %s version sort to setuptools.' % pepno)
 
-    projects = get_projects(VERSION_CACHE)
-    num_projects = len(projects)
-
-    null_projects = Category("No releases", num_projects)
-    compatible_projects = Category("Compatible", num_projects)
-    translated_projects = Category("Compatible with translation", num_projects)
-    filtered_projects = Category("Compatible with filtering", num_projects)
-    sort_error_translated_projects = Category("Translations sort differently", num_projects)
-    sort_error_compatible_projects = Category("Incompatible due to sorting errors", num_projects)
-    incompatible_projects = Category("Incompatible", num_projects)
-
-    categories = [
-        null_projects,
-        compatible_projects,
-        translated_projects,
-        filtered_projects,
-        sort_error_translated_projects,
-        sort_error_compatible_projects,
-        incompatible_projects,
-    ]
-
-    sort_failures = 0
-    for i, (pname, versions) in enumerate(projects.items()):
-        if i % 100 == 0:
-            sys.stderr.write('%s / %s\r' % (i, num_projects))
-            sys.stderr.flush()
-        if not versions:
-            logger.debug('%-15.15s has no releases', pname)
-            null_projects.add(pname)
-            continue
-        # list_legacy and list_pep will contain 2-tuples
-        # comprising a sortable representation according to either
-        # the setuptools (legacy) algorithm or the PEP algorithm.
-        # followed by the original version string
-        list_legacy = [(legacy_key(v), v) for v in versions]
-        # Go through the PEP 386/426 stuff one by one, since
-        # we might get failures
-        list_pep = []
-        excluded_versions = set()
-        translated_versions = set()
-        for v in versions:
-            try:
-                k = sort_key(v)
-            except Exception:
-                s = suggest_normalized_version(v)
-                if not s:
-                    good = False
-                    logger.debug('%-15.15s failed for %r, no suggestions', pname, v)
-                    excluded_versions.add(v)
-                    continue
-                else:
-                    try:
-                        k = sort_key(s)
-                    except ValueError:
-                        logger.error('%-15.15s failed for %r, with suggestion %r',
-                                     pname, v, s)
-                        excluded_versions.add(v)
-                        continue
-                logger.debug('%-15.15s translated %r to %r', pname, v, s)
-                translated_versions.add(v)
-            list_pep.append((k, v))
-        if not list_pep:
-            logger.debug('%-15.15s has no compatible releases', pname)
-            incompatible_projects.add(pname)
-            continue
-        # Now check the versions sort as expected
-        if excluded_versions:
-            list_legacy = [(k, v) for k, v in list_legacy
-                                              if v not in excluded_versions]
-        assert len(list_legacy) == len(list_pep)
-        sorted_legacy = sorted(list_legacy)
-        sorted_pep = sorted(list_pep)
-        sv_legacy = [t[1] for t in sorted_legacy]
-        sv_pep = [t[1] for t in sorted_pep]
-        if sv_legacy != sv_pep:
-            if translated_versions:
-                 logger.debug('%-15.15s translation creates sort differences', pname)
-                 sort_error_translated_projects.add(pname)
-            else:
-                 logger.debug('%-15.15s incompatible due to sort errors', pname)
-                 sort_error_compatible_projects.add(pname)
-            logger.debug('%-15.15s unequal: legacy: %s', pname, sv_legacy)
-            logger.debug('%-15.15s unequal: pep%s: %s', pname, pepno, sv_pep)
-            continue
-        # The project is compatible to some degree,
-        if excluded_versions:
-            logger.debug('%-15.15s has some compatible releases', pname)
-            filtered_projects.add(pname)
-            continue
-        if translated_versions:
-            logger.debug('%-15.15s is compatible after translation', pname)
-            translated_projects.add(pname)
-            continue
-        logger.debug('%-15.15s is fully compatible', pname)
-        compatible_projects.add(pname)
-
-    for category in categories:
-        print(category)
-
+    projects, public = get_projects(VERSION_CACHE)
+    print()
+    Analysis("release versions", public, releases_only=True).print_report()
+    print()
+    Analysis("public versions", public).print_report()
+    print()
+    Analysis("all versions", projects).print_report()
     # Uncomment the line below to explore differences in details
     # import pdb; pdb.set_trace()
     # Grepping the log files is also informative

-- 
Repository URL: http://hg.python.org/peps


More information about the Python-checkins mailing list