The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond
Hi all,
https://github.com/python/cpython is now live as a semi-official, *read
only* Github mirror of the CPython Mercurial repository. Let me know if you
have any problems/concerns.
I still haven't decided how often to update it (considering either just N
times a day, or maybe use a Hg hook for batching). Suggestions are welcome.
The methodology I used to create it is via hg-fast-export. I also tried to
pack and gc the git repo as much as possible before the initial Github push
- it went down from almost ~2GB to ~200MB (so this is the size of a fresh
clone right now).
Eli
P.S. thanks Jesse for the keys to https://github.com/python
I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
Hi,
On core-mentorship someone asked about PEP 3145 - Asynchronous I/O for
subprocess.popen. I answered that asyncio now has subprocess support
(including non-blocking I/O on the three standard stream pipes), so
it's not obvious anything else is needed.
Should we change the PEP's status to Rejected or Superseded?
Regards
Antoine.
Hi,
I noticed configparser does behave in a surprising way when a key
has a special meaning in ini-format.
Consider this example:
>>> cp = configparser.ConfigParser()
>>> cp.read_dict({'DEFAULT': {';foo': 'bar'}})
>>> cp.write(sys.stdout)
[DEFAULT]
;foo = bar
Now when reading this again, ";foo = bar" will be a comment and
ignored. There's probably similiar behaviour in other corner cases
(like if you'd use "[foo]" as key for example).
While it seems ConfigParser doesn't do any escaping as all, I'm
thinking it should at least raise some exception when such a value is
trying to be set.
I'd expect writing something and then reading it back via the same
configparser to *always* result in the same data, as long as writing
worked without error.
Thoughts? Should I submit a bug report?
Florian
--
() ascii ribbon campaign - stop html mail www.asciiribbon.org
/\ www.the-compiler.org | I love long mails http://email.is-not-s.ms/
To give happiness is to deserve happiness.
Gentlemen,
I'd like to politely ask for a second pair of eyes on [issue6839]. It's
been dragging for a very long time, and the fix is really a change from a
raise() to a debugging print.
Thanks in advance,
Adam Polkosnik
Hello,
It's more or less known fact (ref: google) that you can't inherit from
multiple generic builtin (as in "coded in C") types:
class C(dict, list): pass
TypeError: multiple bases have instance lay-out conflict
However, more detailed googling led me to this page of a book:
http://books.google.com.ua/books?id=JnR9hQA3SncC&pg=PA104&lpg=PA104 ,
which suggest that there might be adhoc support for some native types
to serve as multiple bases together, but it doesn't provide an example.
The book is apparently focused on Python2.
I tried to look at typeobject.c:best_base(), which throws the quoted
error above, and the only special rules I could quickly decipher from
it is that it's possible to multi-inherit from a class and its
subclass. Intuitively, it can be understood - it makes no sense to do
that on the same inheritance level, but can happen with recursive
inheritance, and then there's no need for conflict - both classes can be
"merged" into subclass.
Indeed, following works:
import _collections
class Foo(_collections.defaultdict, dict): pass
(opposite order gives MRO conflict)
So, is that it, or disjoint native types are supported as bases
somehow? Also, would someone know if a class-subclass case happens for
example in stdlib?
As the previous question,
https://mail.python.org/pipermail/python-dev/2014-April/134342.html
this one is to help set adequate requirements for implementing multiple
inheritance in MicroPython.
Thanks,
Paul mailto:pmiscml@gmail.com
(1) Should fixes to a docstring go in with a patch, even if they
aren't related to the changing functionality?
Bytestring compilation has several orthogonal parameters. Most -- but
not all -- are documented in the docstring. (Specifically, there is
no explanation of the rx parameter which acts as a filter, and no
mention that symbolic links to directories are ignored.)
It is best if a commit changes one small thing at a time.
On the other hand, Nick recently posted that the minimal overhead of a
patch commit is about half an hour.
Is that overhead enough to override the one-issue-per-patch guideline?
(2) The patch adds new functionality to use multiple processes in
parallel. The normal parameter values are integers indicating how
many processes to use. The parameter also needs two special values --
one to indicate "use os.cpu_count", and the other to indicate "don't
use multiprocessing at all".
(A) Is there a Best Practices for this situation, with two odd cases?
(B) Claudiu originally copied the API from a similar APU for
regrtest. What is the barrier for "do it sensibly" vs "stick with
precedent elsewhere"? (Apparently regrtest treats any negative number
as a request for the cpu_count calculation; I suspect that "-5" is
more likely to be an escaping error for "5" than it is to be a real
request for auto-calculation that just happened to choose -5 instead
of -1.)
(C) How important is it to keep the API consistent between a
top-level CLI command and the internal implementation? At the the
moment, the request for cpu_count is handled in the CLI wrapper, and
not available to interactive callers. On the other hand, interactive
callers could just call cpu_count themselves...
(D) How important is it to maintain consistency with other uses of
the same tool -- multiprocessing has its own was of requesting
auto-calculation. (So someone used to multiprocessing might assume
that "None" meant to auto-calculate, as opposed to "don't use
multiprocessing at all.")
-jJ
Greetings,
I've just woken up and noticed Python 2.7.7 is on track to be released, and in a
rather unique event contains a few security enhancements
in addition to the usual fixes:
http://legacy.python.org/dev/peps/pep-0466/
I thought this might be a good time to make a final plea to fix a
long-standing security issue in the installer on Windows. By default it
installs Python to the root folder, thereby bypassing filesystem permissions:
http://bugs.python.org/issue1284316
The main rationale given (for not using the standard %ProgramFiles%) has been
that the full path to python is too long to type, and ease of use is more
important than the security benefits given by following Windows conventions.
However, adding python to the PATH variable is an alternative solution that
would result in even fewer keystrokes needing to be typed at a console, while
maintaining system security.
As this is an installer setting and not a code change, I gather that the
opportunities for code breakage should be fewer. Please consider updating
this setting to a more secure and friendly default, for 2.7.7 and 3.5+ as well.
-Mike