The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond
Hi all,
https://github.com/python/cpython is now live as a semi-official, *read
only* Github mirror of the CPython Mercurial repository. Let me know if you
have any problems/concerns.
I still haven't decided how often to update it (considering either just N
times a day, or maybe use a Hg hook for batching). Suggestions are welcome.
The methodology I used to create it is via hg-fast-export. I also tried to
pack and gc the git repo as much as possible before the initial Github push
- it went down from almost ~2GB to ~200MB (so this is the size of a fresh
clone right now).
Eli
P.S. thanks Jesse for the keys to https://github.com/python
I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
It appears there's no obvious link from bugs.python.org to the contributor
agreement - you need to go via the unintuitive link Foundation ->
Contribution Forms (and from what I've read, you're prompted when you add a
patch to the tracker).
I'd suggest that if the "Contributor Form Received" field is "No" in user
details, there be a link to http://www.python.org/psf/contrib/.
Tim Delaney
Hello.
I would like to discuss on the language summit a potential inclusion
of cffi[1] into stdlib. This is a project Armin Rigo has been working
for a while, with some input from other developers. It seems that the
main reason why people would prefer ctypes over cffi these days is
"because it's included in stdlib", which is not generally the reason I
would like to hear. Our calls to not use C extensions and to use an
FFI instead has seen very limited success with ctypes and quite a lot
more since cffi got released. The API is fairly stable right now with
minor changes going in and it'll definitely stablize until Python 3.4
release. Notable projects using it:
* pypycore - gevent main loop ported to cffi
* pgsql2cffi
* sdl-cffi bindings
* tls-cffi bindings
* lxml-cffi port
* pyzmq
* cairo-cffi
* a bunch of others
So relatively a lot given that the project is not even a year old (it
got 0.1 release in June). As per documentation, the advantages over
ctypes:
* The goal is to call C code from Python. You should be able to do so
without learning a 3rd language: every alternative requires you to
learn their own language (Cython, SWIG) or API (ctypes). So we tried
to assume that you know Python and C and minimize the extra bits of
API that you need to learn.
* Keep all the Python-related logic in Python so that you don’t need
to write much C code (unlike CPython native C extensions).
* Work either at the level of the ABI (Application Binary Interface)
or the API (Application Programming Interface). Usually, C libraries
have a specified C API but often not an ABI (e.g. they may document a
“struct” as having at least these fields, but maybe more). (ctypes
works at the ABI level, whereas Cython and native C extensions work at
the API level.)
* We try to be complete. For now some C99 constructs are not
supported, but all C89 should be, including macros (and including
macro “abuses”, which you can manually wrap in saner-looking C
functions).
* We attempt to support both PyPy and CPython, with a reasonable path
for other Python implementations like IronPython and Jython.
* Note that this project is not about embedding executable C code in
Python, unlike Weave. This is about calling existing C libraries from
Python.
so among other things, making a cffi extension gives you the same
level of security as writing C (and unlike ctypes) and brings quite a
bit more flexibility (API vs ABI issue) that let's you wrap arbitrary
libraries, even those full of macros.
Cheers,
fijal
.. [1]: http://cffi.readthedocs.org/en/release-0.5/
Hello,
Following the python-dev discussion, I've written a PEP to recap the
proposal and the various arguments. It's inlined below, and it will
probably appear soon at http://www.python.org/dev/peps/pep-0455/, too.
Regards
Antoine.
PEP: 455
Title: Adding a key-transforming dictionary to collections
Version: $Revision$
Last-Modified: $Date$
Author: Antoine Pitrou <solipsis(a)pitrou.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 13-Sep-2013
Python-Version: 3.4
Post-History:
Abstract
========
This PEP proposes a new data structure for the ``collections`` module,
called "TransformDict" in this PEP. This structure is a mutable mapping
which transforms the key using a given function when doing a lookup, but
retains the original key when reading.
Rationale
=========
Numerous specialized versions of this pattern exist. The most common
is a case-insensitive case-preserving dict, i.e. a dict-like container
which matches keys in a case-insensitive fashion but retains the
original casing. It is a very common need in network programming, as
many protocols feature some arrays of "key / value" properties in their
messages, where the keys are textual strings whose casing isn't
relevant.
Another common request is an identity dict, where keys are matched
according to their respective id()s instead of normal matching.
Both are instances of a more general pattern, where a given
transformation function is applied to keys when looking them up: that
function being ``str.lower`` in the former example and the built-in
``id`` function in the latter.
(it can be said that the pattern *projects* keys from the user-visible
set onto the internal lookup set, hence this PEP's title)
Semantics
=========
TransformDict is a ``MutableMapping`` implementation: it faithfully
implements the well-known API of mutable mappings, as ``dict`` itself
and other dict-like classes in the standard library. Therefore, this
PEP won't rehash the semantics of most TransformDict methods.
The transformation function needn't be bijective, it can be strictly
surjective as in the case-insensitive example::
>>> d = TransformDict(str.lower)
>>> d['SomeKey'] = 5
>>> d['somekey']
5
>>> d['SOMEKEY']
5
TransformDict retains the first key used when creating an entry::
>>> d = TransformDict(str.lower)
>>> d['SomeKey'] = 1
>>> d['somekey'] = 2
>>> list(d.items())
[('SomeKey', 2)]
The original keys needn't be hashable, as long as the transformation
function returns a hashable one::
>>> d = TransformDict(id)
>>> l = [None]
>>> d[l] = 5
>>> l in d
True
Constructor
-----------
As shown in the example aboves, creating a TransformDict requires
passing the key transformation function as the first argument (much
like creating a ``defaultdict`` requires passing the factory function
as first argument).
The constructor also takes other optional arguments which can be used
to initialize the TransformDict with certain key-value pairs. Those
optional arguments are the same as in the ``dict`` and ``defaultdict``
constructors::
>>> d = TransformDict(str.lower, [('Foo': 1)], Bar=2)
>>> sorted(d.items())
[('Bar', 2), ('Foo', 1)]
Alternative proposals and questions
===================================
Retaining the last original key
-------------------------------
Most python-dev respondents found retaining the first user-supplied key
more intuitive than retaining the last. Also, it matches the dict
object's own behaviour when using different but equal keys::
>>> d = {}
>>> d[1] = 'hello'
>>> d[1.0] = 'world'
>>> d
{1: 'world'}
Furthermore, explicitly retaining the last key in a first-key-retaining
scheme is still possible using the following approach::
d.pop(key, None)
d[key] = value
while the converse (retaining the first key in a last-key-retaining
scheme) doesn't look possible without rewriting part of the container's
code.
Using an encoder / decoder pair
-------------------------------
Using a function pair isn't necessary, since the original key is
retained by the container. Moreover, an encoder / decoder pair would
require the transformation to be bijective, which prevents important
use cases like case-insensitive matching.
Providing a transformation function for values
----------------------------------------------
Dictionary values are not used for lookup, their semantics are totally
irrelevant to the container's operation. Therefore, there is no point
in having both an "original" and a "transformed" value: the transformed
value wouldn't be used for anything.
Providing a specialized container, not generic
----------------------------------------------
It was asked why we would provide the generic TransformDict construct
rather than a specialized case-insensitive dict variant. The answer
is that it's nearly as cheap (code-wise and performance-wise) to provide
the generic construct, and it can fill more use cases.
Implementation
==============
A patch for the collections module is tracked on the bug tracker at
http://bugs.python.org/issue18986.
Existing work
=============
Case-insensitive dicts are a popular request:
*
http://twistedmatrix.com/documents/current/api/twisted.python.util.Insensit…
* https://mail.python.org/pipermail/python-list/2013-May/647243.html
* https://mail.python.org/pipermail/python-list/2005-April/296208.html
* https://mail.python.org/pipermail/python-list/2004-June/241748.html
* http://bugs.python.org/msg197376
* http://stackoverflow.com/a/2082169
* http://stackoverflow.com/a/3296782
* http://code.activestate.com/recipes/66315-case-insensitive-dictionary/
* https://gist.github.com/babakness/3901174
* http://www.wikier.org/blog/key-insensitive-dictionary-in-python
* http://en.sharejs.com/python/14534
* http://www.voidspace.org.uk/python/archive.shtml#caseless
Identity dicts have been requested too:
* https://mail.python.org/pipermail/python-ideas/2010-May/007235.html
* http://www.gossamer-threads.com/lists/python/python/209527
Python's own pickle module uses identity lookups for object
memoization:
http://hg.python.org/cpython/file/0e70bf1f32a3/Lib/pickle.py#l234
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
hi, everyone:
I want to compile python 3.3 with bz2 support on RedHat 5.5 but fail to do that. Here is how I do it:
1、download bzip2 and compile it(make、make -f Makefile_libbz2_so、make install)
2、chang to python 3.3 source directory : ./configure --with-bz2=/usr/local/include
3、make
4、make install
after installation complete, I test it:
[root@localhost Python-3.3.0]# python3 -c "import bz2"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.3/bz2.py", line 21, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
ImportError: No module named '_bz2'
By the way, RedHat 5.5 has a built-in python 2.4.3. Would it be a problem?
In the 'PEP453 ready for pronouncement' thread, Donald said
>> Because reality is that new users are still likely to be using Python 2.7.
>> Python 3 is just now starting to be really usable, however there's a huge
>> corpus of existing tutorials, course work, books etc for Python 2.7. As
>> Python 3 becomes more usable that existing corpus of material will be
>> ported over to Python 3 but in the interim there is still a pretty large
>> hurdle for new users to get over.
And Guido replied
> Based on my day-to-day experience this is still very true. (And yes, I'm
> slowly turning the tide. But it will take a long time and I am committed to
> giving users the choice.)
Widely-used and linked web resources tend to persist for a very
long time, so we shouldn't use the prevalence of Python 2 resources
as a reason for excessive caution. The key question is how much
good material is available based on Python 3 - and this has improved
significantly over the past couple of years. The classic "How to
Think Like a Computer Scientist" has an excellent Python 3 version
available at http://interactivepython.org, for example.
Things are changing with print media, too. Pragmatic Programmers are
about to publish the 2nd edition of "Practical Programming", based
on Python 3. Most of the major academic publishers have released
Python 3 books in the last 12 months. The tide is definitely
turning, perhaps has already turned.
Encouraging the continued use of 2.7 for existing programmers is
entirely justifiable, but for *newcomers* to programming I think it
is now much harder to justify. A stronger case could have been made
a couple of years ago, when many important packages were not yet
available for Python 3, but things have changed. Even big frameworks
like Django are now usable with Python 3. If we aren't yet past the
point where package availability shouldn't be regarded as an adoption
barrier by beginners, we are surely very close.
I've been teaching Python as a first language to university students
for many years now, initially with Python 2 and for the last few
years with Python 3. In my experience, they encounter fewer problems
with Python 3 (just as Guido intended, no doubt :) The one stumbling
block in the past has been package availability for project work,
but I don't expect that to a problem this year. All of the web and
GUI development work that they'll be doing with me, for example,
will be done entirely in Python 3.
Nick
On Mon, 30 Sep 2013 17:08:00 +0200 (CEST)
christian.heimes <python-checkins(a)python.org> wrote:
> +
> +* It should return ``0`` for zero length input. (Note: This can be handled as
> + special case, too.)
What is this required for? The main requirement is that equal stuff
hashes equal, but there's no reason for zero-length input to hash to
zero, AFAIK.
Regards
Antoine.