The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond
Hi all,
https://github.com/python/cpython is now live as a semi-official, *read
only* Github mirror of the CPython Mercurial repository. Let me know if you
have any problems/concerns.
I still haven't decided how often to update it (considering either just N
times a day, or maybe use a Hg hook for batching). Suggestions are welcome.
The methodology I used to create it is via hg-fast-export. I also tried to
pack and gc the git repo as much as possible before the initial Github push
- it went down from almost ~2GB to ~200MB (so this is the size of a fresh
clone right now).
Eli
P.S. thanks Jesse for the keys to https://github.com/python
I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
It appears there's no obvious link from bugs.python.org to the contributor
agreement - you need to go via the unintuitive link Foundation ->
Contribution Forms (and from what I've read, you're prompted when you add a
patch to the tracker).
I'd suggest that if the "Contributor Form Received" field is "No" in user
details, there be a link to http://www.python.org/psf/contrib/.
Tim Delaney
Hello.
I would like to discuss on the language summit a potential inclusion
of cffi[1] into stdlib. This is a project Armin Rigo has been working
for a while, with some input from other developers. It seems that the
main reason why people would prefer ctypes over cffi these days is
"because it's included in stdlib", which is not generally the reason I
would like to hear. Our calls to not use C extensions and to use an
FFI instead has seen very limited success with ctypes and quite a lot
more since cffi got released. The API is fairly stable right now with
minor changes going in and it'll definitely stablize until Python 3.4
release. Notable projects using it:
* pypycore - gevent main loop ported to cffi
* pgsql2cffi
* sdl-cffi bindings
* tls-cffi bindings
* lxml-cffi port
* pyzmq
* cairo-cffi
* a bunch of others
So relatively a lot given that the project is not even a year old (it
got 0.1 release in June). As per documentation, the advantages over
ctypes:
* The goal is to call C code from Python. You should be able to do so
without learning a 3rd language: every alternative requires you to
learn their own language (Cython, SWIG) or API (ctypes). So we tried
to assume that you know Python and C and minimize the extra bits of
API that you need to learn.
* Keep all the Python-related logic in Python so that you don’t need
to write much C code (unlike CPython native C extensions).
* Work either at the level of the ABI (Application Binary Interface)
or the API (Application Programming Interface). Usually, C libraries
have a specified C API but often not an ABI (e.g. they may document a
“struct” as having at least these fields, but maybe more). (ctypes
works at the ABI level, whereas Cython and native C extensions work at
the API level.)
* We try to be complete. For now some C99 constructs are not
supported, but all C89 should be, including macros (and including
macro “abuses”, which you can manually wrap in saner-looking C
functions).
* We attempt to support both PyPy and CPython, with a reasonable path
for other Python implementations like IronPython and Jython.
* Note that this project is not about embedding executable C code in
Python, unlike Weave. This is about calling existing C libraries from
Python.
so among other things, making a cffi extension gives you the same
level of security as writing C (and unlike ctypes) and brings quite a
bit more flexibility (API vs ABI issue) that let's you wrap arbitrary
libraries, even those full of macros.
Cheers,
fijal
.. [1]: http://cffi.readthedocs.org/en/release-0.5/
http://bugs.python.org/issue19562
propose to change the first assert in Lib/datetime.py
assert 1 <= month <= 12, month
to
assert 1 <= month <= 12,'month must be in 1..12'
to match the next two asserts out of the *53* in the file. I think that
is the wrong direction of change, but that is not my question here.
Should stdlib code use assert at all?
If user input can trigger an assert, then the code should raise a normal
exception that will not disappear with -OO.
If the assert is testing program logic, then it seems that the test
belongs in the test file, in this case, test/test_datetime.py. For
example, consider the following (backwards) code.
_DI4Y = _days_before_year(5)
# A 4-year cycle has an extra leap day over what we'd get from pasting
# together 4 single years.
assert _DI4Y == 4 * 365 + 1
To me, the constant should be directly set to its known value.
_DI4Y = 4*365 + 1.
The function should then be tested in test_datetime.
self.assertEqual(dt._days_before_year(5), dt._DI4Y)
Is there any policy on use of assert in stdlib production code?
--
Terry Jan Reedy
Howdy friends,
according to pep 404, there will never be an official Python 2.8.
The migration path is from 2.7 to 3.x.
I agree with this strategy in almost all consequences but this one:
Many customers are forced to stick with Python 2.X because of other
products, but they require a Python 2.X version which can be compiled
using Visual Studio 2010 or better.
This is considered an improvement and not a bug fix, where I disagree.
So I see many Python 2.7 repositories on BB/GH which are hacked to support
VS 2010, but they are not converted with the same quality in mind
that fits our standards.
My question
-----------
I have created a very clean Python 2.7.6+ based CPython with the Stackless
additions, that compiles with VS 2010, using the adapted project structure
of Python 3.3.X, and I want to publish that on the Stackless website as the
official "Stackless Python 2.8". If you consider Stackless as official ;-) .
This compiler change is currently the only deviation from CPython 2.7,
but we may support a few easy back-ports on customer demand. We don'd add
any random stuff, of course.
My question is if that is safe to do:
Can I rely on PEP 404 that the "Python 2.8" namespace never will clash
with CPython?
And if not, what do you suggest then?
It will be submitted by end of November, thanks for your quick responses!
all the best -- Chris
--
Christian Tismer :^) <mailto:tismer@stackless.com>
Software Consulting : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/
14482 Potsdam : PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776 fax +49 (30) 700143-0023
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
Hi,
Larry has granted me a special pardon to add an outstanding fix for SSL,
http://bugs.python.org/issue19509 . Right now most stdlib modules
(ftplib, imaplib, nntplib, poplib, smtplib) neither support server name
indication (SNI) nor check the subject name of the peer's certificate
properly. The second issue is a major loop-hole because it allows
man-in-the-middle attack despite CERT_REQUIRED.
With CERT_REQUIRED OpenSSL verifies that the peer's certificate is
directly or indirectly signed by a trusted root certification authority.
With Python 3.4 the ssl module is able to use/load the system's trusted
root certs on all major systems (Linux, Mac, BSD, Windows). On Linux and
BSD it requires a properly configured system openssl to locate the root
certs. This usually works out of the box. On Mac Apple's openssl build
is able to use the keychain API of OSX. I have added code for Windows'
system store.
SSL socket code usually looks like this:
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
# new, by default it loads certs trusted for Purpose.SERVER_AUTH
context.load_default_certs()
sock = socket.create_connection(("example.net", 443))
sslsock = context.wrap_socket(sock)
SSLContext.wrap_socket() wraps an ordinary socket into a SSLSocket. With
verify_mode = CERT_REQUIRED OpenSSL ensures that the peer's SSL
certificate is signed by a trusted root CA. In this example one very
important step is missing. The peer may return *ANY* signed certificate
for *ANY* hostname. These lines do NOT check that the certificate's
information match "example.net". An attacker can use any arbitrary
certificate (e.g. for "www.evil.net"), get it signed and abuse it for
MitM attacks on "mail.python.org".
http://docs.python.org/3/library/ssl.html#ssl.match_hostname must be
used to verify the cert. It's easy to forget it...
I have thought about multiple ways to fix the issue. At first I added a
new argument "check_hostname" to all affected modules and implemented
the check manually. For every module I had to modify several places for
SSL and STARTTLS and add / change about 10 lines. The extra lines are
required to properly shutdown and close the connection when the cert
doesn't match the hostname. I don't like the solution because it's
tedious. Every 3rd party author has to copy the same code, too.
Then I came up with a better solution:
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.load_default_certs()
context.check_hostname = True # <-- NEW
sock = socket.create_connection(("example.net", 443))
# server_hostname is already used for SNI
sslsock = context.wrap_socket(sock, server_hostname="example.net")
This fix requires only a new SSLContext attribute and a small
modification to SSLSocket.do_handshake():
if self.context.check_hostname:
try:
match_hostname(self.getpeercert(), self.server_hostname)
except Exception:
self.shutdown(_SHUT_RDWR)
self.close()
raise
Pros:
* match_hostname() is done in one central place
* the cert is matched as early as possible
* no extra arguments for APIs, a context object is enough
* library developers just have to add server_hostname to get SNI and
hostname checks at the same time
* users of libraries can configure cert verification and checking on the
same object
* missing checks will not pass silently
Cons:
* Doesn't work with OpenSSL < 0.9.8f (released 2007) because older
versions lack SNI support. The ssl module raises an exception for
server_hostname if SNI is not supported.
The default settings for all stdlib modules will still be verify_mode =
CERT_NONE and check_hostname = False for maximum backward compatibility.
Python 3.4 comes with a new function ssl.create_default_context() that
returns a new context with best practice settings and loaded root CA
certs. The settings are TLS 1.0, no weak and insecure ciphers (no MD5,
no RC4), no compression (CRIME attack), CERT_REQUIRED and check_hostname
= True (for client side only).
http://bugs.python.org/issue19509 has a working patch for ftplib.
Comments?
Christian
I wanted to help people who are trying to find out more
about PEP submission process by providing relevant
info (or a pointer) in README.rst that is located at the
root of PEPs repository. You can see it here.
https://bitbucket.org/rirror/peps
I filled this issue with b.p.o
http://bugs.python.org/issue19822
However, as you may see, I met some resistance,
which said me to discuss the matter here. So, I'd like
to discuss two things:
1. is my README.rst enhancement request is wrong?
2. what makes people close tickets without providing
arguments, but using stuff like "should"?
It is not the first stuff that I personally find discouraging,
which may be attributed to my reputation, but what's the
point in doing this?
The only logical explanation I find for resistance in this
particular case is that people in core Python
development don't take usability and user experience
matters seriously. Most of core development is about
systems, not about people, so everybody is scratching
own itches. I wish that usability and UX was given at
least some coverage in Python development regulations.
--
anatoly t.
Hi,
I'm trying to write an example of usage of the new
tracemalloc.get_object_traceback() function. Last month I proposed to
use it to give the traceback where a file/socket was allocated on
ResourceWarning:
https://mail.python.org/pipermail/python-dev/2013-October/129923.html
I found a working solution using monkey patching of the socket, _pyio
and builtins modules:
https://bitbucket.org/haypo/misc/src/tip/python/res_warn.py
This hack uses tracemalloc to retrieve the traceback where the
file/socket was allocated, but you can do the same using
traceback.extract_stack() for older Python version. So it's just an
example to show how tracemalloc can be used.
Example with test_poplib (without res_warn.py):
[1/1] test_poplib
/home/haypo/prog/python/default/Lib/test/support/__init__.py:1331:
ResourceWarning: unclosed <ssl.SSLSocket fd=5,
family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1',
45747), raddr=('127.0.0.1', 48933)>
gc.collect()
Ok, you now that the socket was destroyed be the garbage collector...
But which test created this socket???
Example of res_warn.py with test_poplib:
[1/1] test_poplib
/home/haypo/prog/python/default/Lib/test/support/__init__.py:1331:
ResourceWarning: unclosed <ssl.SSLSocket fd=5,
family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1',
49715), raddr=('127.0.0.1', 43904)>
gc.collect()
Allocation traceback (most recent first):
File '/home/haypo/prog/python/default/Lib/ssl.py', line 340
_context=self)
File '/home/haypo/prog/python/default/Lib/poplib.py', line 390
self.sock = context.wrap_socket(self.sock)
File '/home/haypo/prog/python/default/Lib/test/test_poplib.py', line 407
self.client.stls()
File '/home/haypo/prog/python/default/Lib/unittest/case.py', line 567
self.setUp()
File '/home/haypo/prog/python/default/Lib/unittest/case.py', line 610
return self.run(*args, **kwds)
(...)
You can see that the socket was created by test_poplib at line 407
TestPOP3_TLSClass.setUp). It's more useful than the previous warning
:-)
I found two issues in _pyio and tracemalloc modules. These issues
should be fixed to use res_warn.py on text or buffered files and use
it during Python shutdown:
http://bugs.python.org/issue19831http://bugs.python.org/issue19829
Victor