Argument Clinic "converters" specify how to convert an individual
argument to the function you're defining. Although a converter could
theoretically represent any sort of conversion, most of the time they
directly represent types like "int" or "double" or "str".
Because there's such variety in argument parsing, the converters are
customizable with parameters. Many of these are common enough that
Argument Clinic suggests some standard names. Examples: "zeroes=True"
for strings and buffers means "permit internal \0 characters", and
"bitwise=True" for unsigned integers means "copy the bits over, even if
there's overflow/underflow, and even if the original is negative".
A third example is "nullable=True", which means "also accept None for
this parameter". This was originally intended for use with strings
(compare the "s" and "z" format units for PyArg_ParseTuple), however it
looks like we'll have a use for "nullable ints" in the ongoing Argument
Clinic conversion work.
Several people have said they found the name "nullable" surprising,
suggesting I use another name like "allow_none" or "noneable". I, in
turn, find their surprise surprising; "nullable" is a term long
associated with exactly this concept. It's used in C# and SQL, and the
term even has its own Wikipedia page:
http://en.wikipedia.org/wiki/Nullable_type
Most amusingly, Vala *used* to have an annotation called "(allow-none)",
but they've broken it out into two annotations, "(nullable)" and
"(optional)".
http://blogs.gnome.org/desrt/2014/05/27/allow-none-is-dead-long-live-nullab…
Before you say "the term 'nullable' will confuse end users", let me
remind you: this is not user-facing. This is a parameter for an
Argument Clinic converter, and will only ever be seen by CPython core
developers. A group which I hope is not so easily confused.
It's my contention that "nullable" is the correct name. But I've been
asked to bring up the topic for discussion, to see if a consensus forms
around this or around some other name.
Let the bike-shedding begin,
//arry/
After reading this http://bugs.python.org/issue23085 and remembering
struggling having our own patches into cpython's libffi (but not into
libffi itself), I wonder, is there any reason any more for libffi
being included in CPython?
Cheers,
fijal
Here is some proposed wording. Since it is more of a clarification of what
it takes to garner support -- which is just a new section -- rather than a
complete rewrite I'm including just the diff to make it easier to read the
changes.
*diff -r 49d18bb47ebc pep-0011.txt*
*--- a/pep-0011.txt Wed May 14 11:18:22 2014 -0400*
*+++ b/pep-0011.txt Fri May 16 13:48:30 2014 -0400*
@@ -2,22 +2,21 @@
Title: Removing support for little used platforms
Version: $Revision$
Last-Modified: $Date$
-Author: martin(a)v.loewis.de (Martin von Löwis)
+Author: Martin von Löwis <martin(a)v.loewis.de>,
+ Brett Cannon <brett(a)python.org>
Status: Active
Type: Process
Content-Type: text/x-rst
Created: 07-Jul-2002
Post-History: 18-Aug-2007
+ 16-May-2014
Abstract
--------
-This PEP documents operating systems (platforms) which are not
-supported in Python anymore. For some of these systems,
-supporting code might be still part of Python, but will be removed
-in a future release - unless somebody steps forward as a volunteer
-to maintain this code.
+This PEP documents how an operating system (platform) garners
+support in Python as well as documenting past support.
Rationale
@@ -37,16 +36,53 @@
change to the Python source code will work on all supported
platforms.
-To reduce this risk, this PEP proposes a procedure to remove code
-for platforms with no Python users.
+To reduce this risk, this PEP specifies what is required for a
+platform to be considered supported by Python as well as providing a
+procedure to remove code for platforms with little or no Python
+users.
+Supporting platforms
+--------------------
+
+Gaining official platform support requires two things. First, a core
+developer needs to volunteer to maintain platform-specific code. This
+core developer can either already be a member of the Python
+development team or be given contributor rights on the basis of
+maintaining platform support (it is at the discretion of the Python
+development team to decide if a person is ready to have such rights
+even if it is just for supporting a specific platform).
+
+Second, a stable buildbot must be provided [2]_. This guarantees that
+platform support will not be accidentally broken by a Python core
+developer who does not have personal access to the platform. For a
+buildbot to be considered stable it requires that the machine be
+reliably up and functioning (but it is up to the Python core
+developers to decide whether to promote a buildbot to being
+considered stable).
+
+This policy does not disqualify supporting other platforms
+indirectly. Patches which are not platform-specific but still done to
+add platform support will be considered for inclusion. For example,
+if platform-independent changes were necessary in the configure
+script which was motivated to support a specific platform that would
+be accepted. Patches which add platform-specific code such as the
+name of a specific platform to the configure script will generally
+not be accepted without the platform having official support.
+
+CPU architecture and compiler support are viewed in a similar manner
+as platforms. For example, to consider the ARM architecture supported
+a buildbot running on ARM would be required along with support from
+the Python development team. In general it is not required to have
+a CPU architecture run under every possible platform in order to be
+considered supported.
Unsupporting platforms
----------------------
-If a certain platform that currently has special code in it is
-deemed to be without Python users, a note must be posted in this
-PEP that this platform is no longer actively supported. This
+If a certain platform that currently has special code in Python is
+deemed to be without Python users or lacks proper support from the
+Python development team and/or a buildbot, a note must be posted in
+this PEP that this platform is no longer actively supported. This
note must include:
- the name of the system
@@ -69,8 +105,8 @@
forward and offer maintenance.
-Resupporting platforms
-----------------------
+Re-supporting platforms
+-----------------------
If a user of a platform wants to see this platform supported
again, he may volunteer to maintain the platform support. Such an
@@ -101,7 +137,7 @@
release is made. Developers of extension modules will generally need
to use the same Visual Studio release; they are concerned both with
the availability of the versions they need to use, and with keeping
-the zoo of versions small. The Python source tree will keep
+the zoo of versions small. The Python source tree will keep
unmaintained build files for older Visual Studio releases, for which
patches will be accepted. Such build files will be removed from the
source tree 3 years after the extended support for the compiler has
@@ -223,6 +259,7 @@
----------
.. [1] http://support.microsoft.com/lifecycle/
+.. [2] http://buildbot.python.org/3.x.stable/
Copyright
---------
Hi David,
I noticed you run the "Builder x86 Ubuntu Shared" buildbot. It seems
it's running a very old version of Ubuntu. Is there any chance of
getting that updated?
Regards,
Benjamin
As promised in the "Move selected documentation repos to PSF BitBucket
account?" thread I've written up a PEP for moving selected repositories from
hg.python.org to Github.
You can see this PEP online at: https://www.python.org/dev/peps/pep-0481/
I've also reproduced the PEP below for inline discussion.
-----------------------
Abstract
========
This PEP proposes migrating to Git and Github for certain supporting
repositories (such as the repository for Python Enhancement Proposals) in a way
that is more accessible to new contributors, and easier to manage for core
developers. This is offered as an alternative to PEP 474 which aims to achieve
the same overall benefits but while continuing to use the Mercurial DVCS and
without relying on a commerical entity.
In particular this PEP proposes changes to the following repositories:
* https://hg.python.org/devguide/
* https://hg.python.org/devinabox/
* https://hg.python.org/peps/
This PEP does not propose any changes to the core development workflow for
CPython itself.
Rationale
=========
As PEP 474 mentions, there are currently a number of repositories hosted on
hg.python.org which are not directly used for the development of CPython but
instead are supporting or ancillary repositories. These supporting repositories
do not typically have complex workflows or often branches at all other than the
primary integration branch. This simplicity makes them very good targets for
the "Pull Request" workflow that is commonly found on sites like Github.
However where PEP 474 wants to continue to use Mercurial and wishes to use an
OSS and self-hosted and therefore restricts itself to only those solutions this
PEP expands the scope of that to include migrating to Git and using Github.
The existing method of contributing to these repositories generally includes
generating a patch and either uploading them to bugs.python.org or emailing
them to peps(a)python.org. This process is unfriendly towards non-comitter
contributors as well as making the process harder than it needs to be for
comitters to accept the patches sent by users. In addition to the benefits
in the pull request workflow itself, this style of workflow also enables
non techincal contributors, especially those who do not know their way around
the DVCS of choice, to contribute using the web based editor. On the committer
side the Pull Requests enable them to tell, before merging, whether or not
a particular Pull Request will break anything. It also enables them to do a
simple "push button" merge which does not require them to check out the
changes locally. Another such feature that is useful in particular for docs,
is the ability to view a "prose" diff. This Github specific feature enables
a committer to view a diff of the rendered output which will hide things like
reformatting a paragraph and show you what the actual "meat" of the change
actually is.
Why Git?
--------
Looking at the variety of DVCS which are available today it becomes fairly
clear that git has gotten the vast mindshare of people who are currently using
it. The Open Hub (Previously Ohloh) statistics [#openhub-stats]_ show that
currently 37% of the repositories Open Hub is indexing is using git which is
second only to SVN (which has 48%) while Mercurial has just 2% of the indexed
repositories (beating only bazaar which has 1%). In additon to the Open Hub
statistics a look at the top 100 projects on PyPI (ordered by total download
counts) shows us that within the Python space itself there is a majority of
projects using git:
=== ========= ========== ====== === ====
Git Mercurial Subversion Bazaar CVS None
=== ========= ========== ====== === ====
62 22 7 4 1 1
=== ========= ========== ====== === ====
Chosing a DVCS which has the larger mindshare will make it more likely that any
particular person who has experience with DVCS at all will be able to
meaningfully use the DVCS that we have chosen without having to learn a new
tool.
In addition to simply making it more likely that any individual will already
know how to use git, the number of projects and people using it means that the
resources for learning the tool are likely to be more fully fleshed out and
when you run into problems the liklihood that someone else had that problem
and posted a question and recieved an answer is also far likelier.
Thirdly by using a more popular tool you also increase your options for tooling
*around* the DVCS itself. Looking at the various options for hosting
repositories it's extremely rare to find a hosting solution (whether OSS or
commerical) that supports Mercurial but does not support Git, on the flip side
there are a number of tools which support Git but do not support Mercurial.
Therefore the popularity of git increases the flexibility of our options going
into the future for what toolchain these projects use.
Also by moving to the more popular DVCS we increase the likelhood that the
knowledge that the person has learned in contributing to these support
repositories will transfer to projects outside of the immediate CPython project
such as to the larger Python community which is primarily using Git hosted on
Github.
In previous years there was concern about how well supported git was on Windows
in comparison to Mercurial. However git has grown to support Windows as a first
class citizen. In addition to that, for Windows users who are not well aquanted
with the Windows command line there are GUI options as well.
On a techincal level git and Mercurial are fairly similar, however the git
branching model is signifcantly better than Mercurial "Named Branches" for
non-comitter contributors. Mercurial does have a "Bookmarks" extension however
this isn't quite as good as git's branching model. All bookmarks live in the
same namespace so it requires individual users to ensure that they namespace
the branchnames themselves lest the risk collision. It also is an extension
which requires new users to first discover they need an extension at all and
then figure out what they need to do in order to enable that extension. Since
it is an extension it also means that in general support for them outside of
Mercurial core is going to be less than 100% in comparison to git where the
feature is built in and core to using git at all. Finally users who are not
used to Mercurial are unlikely to discover bookmarks on their own, instead they
will likely attempt to use Mercurial's "Named Branches" which, given the fact
they live "forever", are not often what a project wants their contributors to
use.
Why Github?
-----------
There are a number of software projects or web services which offer
functionality similar to that of Github. These range from commerical web
services such as a Bitbucket to self-hosted OSS solutions such as Kallithea or
Gitlab. This PEP proposes that we move these repositories to Github.
There are two primary reasons for selecting Github: Popularity and
Quality/Polish.
Github is currently the most popular hosted repository hosting according to
Alexa where it currently has a global rank of 121. Much like for Git itself by
choosing the most popular tool we gain benefits in increasing the likelhood
that a new contributor will have already experienced the toolchain, the quality
and availablity of the help, more and better tooling being built around it, and
the knowledge transfer to other projects. A look again at the top 100 projects
by download counts on PyPI shows the following hosting locations:
====== ========= =========== ========= =========== ==========
GitHub BitBucket Google Code Launchpad SourceForge Other/Self
====== ========= =========== ========= =========== ==========
62 18 6 4 3 7
====== ========= =========== ========= =========== ==========
In addition to all of those reasons, Github also has the benefit that while
many of the options have similar features when you look at them in a feature
matrix the Github version of each of those features tend to work better and be
far more polished. This is hard to quantify objectively however it is a fairly
common sentiment if you go around and ask people who are using these services
often.
Finally a reason to choose a web service at all over something that is
self-hosted is to be able to more efficiently use volunteer time and donated
resources. Every additional service hosted on the PSF infrastruture by the
PSF infrastructure team further spreads out the amount of time that the
volunteers on that team have to spend and uses some chunk of resources that
could potentionally be used for something where there is no free or affordable
hosted solution available.
One concern that people do have with using a hosted service is that there is a
lack of control and that at some point in the future the service may no longer
be suitable. It is the opinion of this PEP that Github does not currently and
has not in the past engaged in any attempts to lock people into their platform
and that if at some point in the future Github is no longer suitable for one
reason or another than at that point we can look at migrating away from Github
onto a different solution. In other words, we'll cross that bridge if and when
we come to it.
Example: Scientific Python
--------------------------
One of the key ideas behind the move to both git and Github is that a feature
of a DVCS, the repository hosting, and the workflow used is the social network
and size of the community using said tools. We can see this is true by looking
at an example from a sub-community of the Python community: The Scientific
Python community. They have already migrated most of the key pieces of the
SciPy stack onto Github using the Pull Request based workflow starting with
IPython and as more projects moved over it became a natural default for new
projects.
They claim to have seen a great benefit from this move, where it enables casual
contributors to easily move between different projects within their
sub-community without having to learn a special, bespoke workflow and a
different toolchain for each project. They've found that when people can use
their limited time on actually contributing instead of learning the different
tools and workflows that not only do they contribute more to one project, that
they also expand out and contribute to other projects. This move is also
attributed to making it commonplace for members of that community to go so far
as publishing their research and educational materials on Github as well.
This showcases the real power behind moving to a highly popular toolchain and
workflow, as each variance introduces yet another hurdle for new and casual
contributors to get past and it makes the time spent learning that workflow
less reusable with other projects.
Migration
=========
Through the use of hg-git [#hg-git]_ we can easily convert a Mercurial
repository to a Git repository by simply pushing the Mercurial repository to
the Git repository. People who wish to continue to use Mercurual locally can
then use hg-git going into the future using the new Github URL, however they
will need to re-clone their repositories as using Git as the server seems to
trigger a one time change of the changeset ids.
As none of the selected repositories have any tags, branches, or bookmarks
other than the ``default`` branch the migration will simply map the ``default``
branch in Mercurial to the ``master`` branch in git.
In addition since none of the selected projects have any great need of a
complex bug tracker, they will also migrate their issue handling to using the
GitHub issues.
In addition to the migration of the repository hosting itself there are a
number of locations for each particular repository which will require updating.
The bulk of these will simply be changing commands from the hg equivilant to
the git equivilant.
In particular this will include:
* Updating www.python.org to generate PEPs using a git clone and link to
Github.
* Updating docs.python.org to pull from Github instead of hg.python.org for the
devguide.
* Enabling the ability to send an email to python-checkins(a)python.org for each
push.
* Enabling the ability to send an IRC message to #python-dev on Freenode for
each push.
* Migrate any issues for these projects to their respective bug tracker on
Github.
This will restore these repositories to similar functionality as they currently
have. In addition to this the migration will also include enabling testing for
each pull request using Travis CI [#travisci]_ where possible to ensure that
a new PR does not break the ability to render the documentation or PEPs.
User Access
===========
Moving to Github would involve adding an additional user account that will need
to be managed, however it also offers finer grained control, allowing the
ability to grant someone access to only one particular repository instead of
the coarser grained ACLs available on hg.python.org.
References
==========
.. [#openhub-stats] `Open Hub Statistics <https://www.openhub.net/repositories/compare>`
.. [#hg-git] `hg-git <https://hg-git.github.io/>`
.. [#travisci] `Travis CI <https://travis-ci.org/>`
---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
newbie first post on this list, if what follows is of context ...
Hi all,
I'm struggling with issue per the subject, read different threads and
issue http://bugs.python.org/issue15443 that started 2012 still opened
as of today.
Isn't there a legitimate case for nanosecond support? it's all over the
place in 'struct timespec' and maybe wrongly I always found python and C
were best neighbors. That's for the notional aspect.
More practically, aren't we close enough yet with current hardware, PTP
and the likes, this deserves more consideration?
Maybe this has been mentioned before but the limiting factor isn't just
getting nanoseconds, but anything sub-microseconds wont work with the
current format. OpcUA that I was looking right now has 10-th us
resolution, so really cares about 100ns, but the datetime 1us simply
wont cut it.
Regards,
Matthieu
This is a bit long as I espoused as if this was a blog post to try and give
background info on my thinking, etc. The TL;DR folks should start at the
"Ideal Scenario" section and read to the end.
P.S.: This is in Markdown and I have put it up at
https://gist.github.com/brettcannon/a9c9a5989dc383ed73b4 if you want a
nicer formatted version for reading.
# History lesson
Since I signed up for the python-dev mailing list way back in June 2002,
there seems to be a cycle where we as a group come to a realization that
our current software development process has not kept up with modern
practices and could stand for an update. For me this was first shown when
we moved from SourceForge to our own infrastructure, then again when we
moved from Subversion to Mercurial (I led both of these initiatives, so
it's somewhat a tradition/curse I find myself in this position yet again).
And so we again find ourselves at the point of realizing that we are not
keeping up with current practices and thus need to evaluate how we can
improve our situation.
# Where we are now
Now it should be realized that we have to sets of users of our development
process: contributors and core developers (the latter whom can play both
roles). If you take a rough outline of our current, recommended process it
goes something like this:
1. Contributor clones a repository from hg.python.org
2. Contributor makes desired changes
3. Contributor generates a patch
4. Contributor creates account on bugs.python.org and signs the
[contributor agreement](https://www.python.org/psf/contrib/contrib-form/)
4. Contributor creates an issue on bugs.python.org (if one does not already
exist) and uploads a patch
5. Core developer evaluates patch, possibly leaving comments through our
[custom version of Rietveld](http://bugs.python.org/review/)
6. Contributor revises patch based on feedback and uploads new patch
7. Core developer downloads patch and applies it to a clean clone
8. Core developer runs the tests
9. Core developer does one last `hg pull -u` and then commits the changes
to various branches
I think we can all agree it works to some extent, but isn't exactly smooth.
There are multiple steps in there -- in full or partially -- that can be
automated. There is room to improve everyone's lives.
And we can't forget the people who help keep all of this running as well.
There are those that manage the SSH keys, the issue tracker, the review
tool, hg.python.org, and the email system that let's use know when stuff
happens on any of these other systems. The impact on them needs to also be
considered.
## Contributors
I see two scenarios for contributors to optimize for. There's the simple
spelling mistake patches and then there's the code change patches. The
former is the kind of thing that you can do in a browser without much
effort and should be a no-brainer commit/reject decision for a core
developer. This is what the GitHub/Bitbucket camps have been promoting
their solution for solving while leaving the cpython repo alone.
Unfortunately the bulk of our documentation is in the Doc/ directory of
cpython. While it's nice to think about moving the devguide, peps, and even
breaking out the tutorial to repos hosting on Bitbucket/GitHub, everything
else is in Doc/ (language reference, howtos, stdlib, C API, etc.). So
unless we want to completely break all of Doc/ out of the cpython repo and
have core developers willing to edit two separate repos when making changes
that impact code **and** docs, moving only a subset of docs feels like a
band-aid solution that ignores the big, white elephant in the room: the
cpython repo, where a bulk of patches are targeting.
For the code change patches, contributors need an easy way to get a hold of
the code and get their changes to the core developers. After that it's
things like letting contributors knowing that their patch doesn't apply
cleanly, doesn't pass tests, etc. As of right now getting the patch into
the issue tracker is a bit manual but nothing crazy. The real issue in this
scenario is core developer response time.
## Core developers
There is a finite amount of time that core developers get to contribute to
Python and it fluctuates greatly. This means that if a process can be found
which allows core developers to spend less time doing mechanical work and
more time doing things that can't be automated -- namely code reviews --
then the throughput of patches being accepted/rejected will increase. This
also impacts any increased patch submission rate that comes from improving
the situation for contributors because if the throughput doesn't change
then there will simply be more patches sitting in the issue tracker and
that doesn't benefit anyone.
# My ideal scenario
If I had an infinite amount of resources (money, volunteers, time, etc.),
this would be my ideal scenario:
1. Contributor gets code from wherever; easiest to just say "fork on GitHub
or Bitbucket" as they would be official mirrors of hg.python.org and are
updated after every commit, but could clone hg.python.org/cpython if they
wanted
2. Contributor makes edits; if they cloned on Bitbucket or GitHub then they
have browser edit access already
3. Contributor creates an account at bugs.python.org and signs the CLA
3. The contributor creates an issue at bugs.python.org (probably the one
piece of infrastructure we all agree is better than the other options,
although its workflow could use an update)
4. If the contributor used Bitbucket or GitHub, they send a pull request
with the issue # in the PR message
5. bugs.python.org notices the PR, grabs a patch for it, and puts it on
bugs.python.org for code review
6. CI runs on the patch based on what Python versions are specified in the
issue tracker, letting everyone know if it applied cleanly, passed tests on
the OSs that would be affected, and also got a test coverage report
7. Core developer does a code review
8. Contributor updates their code based on the code review and the updated
patch gets pulled by bugs.python.org automatically and CI runs again
9. Once the patch is acceptable and assuming the patch applies cleanly to
all versions to commit to, the core developer clicks a "Commit" button,
fills in a commit message and NEWS entry, and everything gets committed (if
the patch can't apply cleanly then the core developer does it the
old-fashion way, or maybe auto-generate a new PR which can be manually
touched up so it does apply cleanly?)
Basically the ideal scenario lets contributors use whatever tools and
platforms that they want and provides as much automated support as possible
to make sure their code is tip-top before and during code review while core
developers can review and commit patches so easily that they can do their
job from a beach with a tablet and some WiFi.
## Where the current proposed solutions seem to fall short
### GitHub/Bitbucket
Basically GitHub/Bitbucket is a win for contributors but doesn't buy core
developers that much. GitHub/Bitbucket gives contributors the easy cloning,
drive-by patches, CI, and PRs. Core developers get a code review tool --
I'm counting Rietveld as deprecated after Guido's comments about the code's
maintenance issues -- and push-button commits **only for single branch
changes**. But for any patch that crosses branches we don't really gain
anything. At best core developers tell a contributor "please send your PR
against 3.4", push-button merge it, update a local clone, merge from 3.4 to
default, do the usual stuff, commit, and then push; that still keeps me off
the beach, though, so that doesn't get us the whole way. You could force
people to submit two PRs, but I don't see that flying. Maybe some tool
could be written that automatically handles the merge/commit across
branches once the initial PR is in? Or automatically create a PR that core
developers can touch up as necessary and then accept that as well?
Regardless, some solution is necessary to handle branch-crossing PRs.
As for GitHub vs. Bitbucket, I personally don't care. I like GitHub's
interface more, but that's personal taste. I like hg more than git, but
that's also personal taste (and I consider a transition from hg to git a
hassle but not a deal-breaker but also not a win). It is unfortunate,
though, that under this scenario we would have to choose only one platform.
It's also unfortunate both are closed-source, but that's not a
deal-breaker, just a knock against if the decision is close.
### Our own infrastructure
The shortcoming here is the need for developers, developers, developers!
Everything outlined in the ideal scenario is totally doable on our own
infrastructure with enough code and time (donated/paid-for infrastructure
shouldn't be an issue). But historically that code and time has not
materialized. Our code review tool is a fork that probably should be
replaced as only Martin von Löwis can maintain it. Basically Ezio Melotti
maintains the issue tracker's code. We don't exactly have a ton of people
constantly going "I'm so bored because everything for Python's development
infrastructure gets sorted so quickly!" A perfect example is that R. David
Murray came up with a nice update for our workflow after PyCon but then ran
out of time after mostly defining it and nothing ever became of it (maybe
we can rectify that at PyCon?). Eric Snow has pointed out how he has
written similar code for pulling PRs from I think GitHub to another code
review tool, but that doesn't magically make it work in our infrastructure
or get someone to write it and help maintain it (no offense, Eric).
IOW our infrastructure can do anything, but it can't run on hopes and
dreams. Commitments from many people to making this happen by a certain
deadline will be needed so as to not allow it to drag on forever. People
would also have to commit to continued maintenance to make this viable
long-term.
# Next steps
I'm thinking first draft PEPs by February 1 to know who's all-in (8 weeks
away), all details worked out in final PEPs and whatever is required to
prove to me it will work by the PyCon language summit (4 months away). I
make a decision by May 1, and
then implementation aims to be done by the time 3.5.0 is cut so we can
switch over shortly thereafter (9 months away). Sound like a reasonable
timeline?
The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond