The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond
I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
Hello.
I would like to discuss on the language summit a potential inclusion
of cffi[1] into stdlib. This is a project Armin Rigo has been working
for a while, with some input from other developers. It seems that the
main reason why people would prefer ctypes over cffi these days is
"because it's included in stdlib", which is not generally the reason I
would like to hear. Our calls to not use C extensions and to use an
FFI instead has seen very limited success with ctypes and quite a lot
more since cffi got released. The API is fairly stable right now with
minor changes going in and it'll definitely stablize until Python 3.4
release. Notable projects using it:
* pypycore - gevent main loop ported to cffi
* pgsql2cffi
* sdl-cffi bindings
* tls-cffi bindings
* lxml-cffi port
* pyzmq
* cairo-cffi
* a bunch of others
So relatively a lot given that the project is not even a year old (it
got 0.1 release in June). As per documentation, the advantages over
ctypes:
* The goal is to call C code from Python. You should be able to do so
without learning a 3rd language: every alternative requires you to
learn their own language (Cython, SWIG) or API (ctypes). So we tried
to assume that you know Python and C and minimize the extra bits of
API that you need to learn.
* Keep all the Python-related logic in Python so that you don’t need
to write much C code (unlike CPython native C extensions).
* Work either at the level of the ABI (Application Binary Interface)
or the API (Application Programming Interface). Usually, C libraries
have a specified C API but often not an ABI (e.g. they may document a
“struct” as having at least these fields, but maybe more). (ctypes
works at the ABI level, whereas Cython and native C extensions work at
the API level.)
* We try to be complete. For now some C99 constructs are not
supported, but all C89 should be, including macros (and including
macro “abuses”, which you can manually wrap in saner-looking C
functions).
* We attempt to support both PyPy and CPython, with a reasonable path
for other Python implementations like IronPython and Jython.
* Note that this project is not about embedding executable C code in
Python, unlike Weave. This is about calling existing C libraries from
Python.
so among other things, making a cffi extension gives you the same
level of security as writing C (and unlike ctypes) and brings quite a
bit more flexibility (API vs ABI issue) that let's you wrap arbitrary
libraries, even those full of macros.
Cheers,
fijal
.. [1]: http://cffi.readthedocs.org/en/release-0.5/
hi, everyone:
I want to compile python 3.3 with bz2 support on RedHat 5.5 but fail to do that. Here is how I do it:
1、download bzip2 and compile it(make、make -f Makefile_libbz2_so、make install)
2、chang to python 3.3 source directory : ./configure --with-bz2=/usr/local/include
3、make
4、make install
after installation complete, I test it:
[root@localhost Python-3.3.0]# python3 -c "import bz2"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.3/bz2.py", line 21, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
ImportError: No module named '_bz2'
By the way, RedHat 5.5 has a built-in python 2.4.3. Would it be a problem?
Hi,
I suspect that this will be put into a proper PEP at some point, but I'd
like to bring this up for discussion first. This came out of issues 13429
and 16392.
http://bugs.python.org/issue13429http://bugs.python.org/issue16392
Stefan
The problem
===========
Python modules and extension modules are not being set up in the same way.
For Python modules, the module is created and set up first, then the module
code is being executed. For extensions, i.e. shared libraries, the module
init function is executed straight away and does both the creation and
initialisation. This means that it knows neither the __file__ it is being
loaded from nor its package (i.e. its FQMN). This hinders relative imports
and resource loading. In Py3, it's also not being added to sys.modules,
which means that a (potentially transitive) re-import of the module will
really try to reimport it and thus run into an infinite loop when it
executes the module init function again. And without the FQMN, it's not
trivial to correctly add the module to sys.modules either.
We specifically run into this for Cython generated modules, for which it's
not uncommon that the module init code has the same level of complexity as
that of any 'regular' Python module. Also, the lack of a FQMN and correct
file path hinders the compilation of __init__.py modules, i.e. packages,
especially when relative imports are being used at module init time.
The proposal
============
I propose to split the extension module initialisation into two steps in
Python 3.4, in a backwards compatible way.
Step 1: The current module init function can be reduced to just creating
the module instance and returning it (and potentially doing some simple C
level setup). Optionally, after creating the module (and this is the new
part), the module init code can register a C callback function that will be
called after setting up the module.
Step 2: The shared library importer receives the module instance from the
module init function, adds __file__, __path__, __package__ and friends to
the module dict, and then checks for the callback. If non-NULL, it calls it
to continue the module initialisation by user code.
The callback
============
The callback is defined as follows::
int (*PyModule_init_callback)(PyObject* the_module,
PyModuleInitContext* context)
"PyModuleInitContext" is a struct that is meant mostly for making the
callback more future proof by allowing additional parameters to be passed
in. For now, I can see a use case for the following fields::
struct PyModuleInitContext {
char* module_name;
char* qualified_module_name;
}
Both names are encoded in UTF-8. As for the file path, I consider it best
to retrieve it from the module's __file__ attribute as a Python string
object to reduce filename encoding problems.
Note that this struct argument is not strictly required, but given that
this proposal would have been much simpler if the module init function had
accepted such an argument in the first place, I consider it a good idea not
to let this chance pass by again.
The registration of the callback uses a new C-API function:
int PyModule_SetInitFunction(PyObject* module,
PyModule_init_callback callback)
The function name uses "Set" instead of "Register" to make it clear that
there is only one such function per module.
An alternative would be a new module creation function "PyModule_Create3()"
that takes the callback as third argument, in addition to what
"PyModule_Create2()" accepts. This would require users to explicitly pass
in the (second) version argument, which might be considered only a minor issue.
Implementation
==============
The implementation requires local changes to the extension module importer
and a new C-API function. In order to store the callback, it should use a
new field in the module object struct.
Open questions
==============
It is not clear how extensions should be handled that register more than
one module in their module init function, e.g. compiled packages. One
possibility would be to leave the setup to the user, who would have to know
all FQMNs anyway in this case, although not the import file path.
Alternatively, the import machinery could use a stack to remember for which
modules a callback was registered during the last init function call, set
up all of them and then call their callbacks. It's not clear if this meets
the intention of the user.
Alternatives
============
1) It would be possible to make extension modules optionally export another
symbol, e.g. "PyInit2_modulename", that the shared library loader would
call in addition to the required function "PyInit_modulename". This would
remove the need for a new API that registers the above callback. The
drawback is that it also makes it easier to write broken code because a
Python version or implementation that does not support this second symbol
would simply not call it, without error. The new C-API function would let
the build fail instead if it is not supported.
2) The callback could be made available as a Python function in the module
dict, thus also removing the need for an explicit registration API.
However, this approach would add overhead to both sides, the importer code
and the user provided module init code, as it would require additional
dictionary handling and the implementation of a one-time Python function in
user code. It would also suffer from the problem that missing support in
the runtime would pass silently.
3) The callback could be registered statically in the PyModuleDef struct by
adding a new field. This is not trivial to do in a backwards compatible way
because the struct would grow longer without explicit initialisation by
existing user code. Extending PyModuleDef_HEAD_INIT might be possible but
would still break at least binary compatibility.
4) Pass a new context argument into the module init function that contains
all information necessary to properly and completely set up the module at
creation time. This would provide a much simpler and cleaner solution than
the proposed solution. However, it will not be possible before Python 4 as
it breaks backwards compatibility with all existing extension modules at
both the source and binary level.
>
> From: Eli Bendersky <eliben(a)gmail.com>
>
> I'll be the first one to admit that pycparser is almost certainly not
> generally useful enough to be exposed in the stdlib. So just using it as an
> implementation detail is absolutely fine. PLY is a more interesting
> question, however, since PLY is somewhat more generally useful. That said,
> I see all this as implementation details that shouldn't distract us from
> the main point of whether cffi should be added.
>
Regarding the inclusion of PLY or some subcomponent of it in the standard library, it's not an entirely crazy idea in my opinion. LALR(1) parsers have been around for a long time, are generally known to anyone who's used yacc/bison, and would be useful outside the context of cffi or pycparser. PLY has also been around for about 12 years and is what I would call stable. It gets an update about every year or two, but that's about it. PLY is also relatively small--just two files and about 4300 lines of code (much of which could probably be scaled down a bit).
The only downside to including PLY might be the fact that there are very few people walking around who've actually had to *implement* an LALR(1) parser generator. Some of the code for that is extremely hairy and mathematical. At this time, I don't think there are any bugs in it, but it's not the sort of thing that one wants to wander into casually. Also, there are some horrible hacks in PLY that I'd really like to get rid of, but am currently stuck with due to backwards compatibility issues.
Alex Gaynor has been working on a PLY variant (RPLY) geared at RPython and which has a slightly different programming interface. I'd say if we were to go down this route, he and I should work together to put together some kind of more general "parsing.lalr" package (or similar) that cleans it up and makes it more suitable as a library for building different kinds of parsing tools on top of.
Cheers,
Dave
[No, I'm not interested in the port myself]
patches for a mingw32 port are floating around on the web and the python bug
tracker, although most of them as a set of patches in one issue addressing
several things, and which maybe outdated for the trunk. at least for me
re-reading a big patch in a new version is more work than having the patches in
different issues. So proposing to break down the patches in independent ones,
dealing with:
- mingw32 support (excluding any cross-build support). tested with
a native build with srcdir == builddir. The changes for distutils
mingw32 support should be a separate patch. Who could review these?
Asked Martin, but didn't get a reply yet.
- patches to cross-build for mingw32.
- patches to deal with a srcdir != builddir configuration, where the
srcdir is read-only (please don't mix with the cross-build support).
All work should be done on the tip/trunk.
So ok to close issue16526, issue3871, issue3754 and suggest in the reports to
start over with more granular changes?
Matthias
As discussed here, the python 2.5 binary distributed by Apple on mountain
lion is broken. Could someone file an official complaint? This is really
bad...
--Guido
--
--Guido van Rossum (python.org/~guido)
Hello all,
PyCon, and the Python Language Summit, is nearly upon us. We have a good number of people confirmed to attend. If you are intending to come to the language summit but haven't let me know please do so.
The agenda of topics for discussion so far includes the following:
* A report on pypy status - Maciej and Armin
* Jython and IronPython status reports - Dino / Frank
* Packaging (Doug Hellmann and Monty Taylor at least)
* Cleaning up interpreter initialisation (both in hopes of finding areas
to rationalise and hence speed things up, as well as making things
more embedding friendly). Nick Coghlan
* Adding new async capabilities to the standard library (Guido)
* cffi and the standard library - Maciej
* flufl.enum and the standard library - Barry Warsaw
* The argument clinic - Larry Hastings
If you have other items you'd like to discuss please let me know and I can add them to the agenda.
All the best,
Michael Foord
--
http://www.voidspace.org.uk/
May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing
http://www.sqlite.org/different.html