The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond
I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
Hello.
I would like to discuss on the language summit a potential inclusion
of cffi[1] into stdlib. This is a project Armin Rigo has been working
for a while, with some input from other developers. It seems that the
main reason why people would prefer ctypes over cffi these days is
"because it's included in stdlib", which is not generally the reason I
would like to hear. Our calls to not use C extensions and to use an
FFI instead has seen very limited success with ctypes and quite a lot
more since cffi got released. The API is fairly stable right now with
minor changes going in and it'll definitely stablize until Python 3.4
release. Notable projects using it:
* pypycore - gevent main loop ported to cffi
* pgsql2cffi
* sdl-cffi bindings
* tls-cffi bindings
* lxml-cffi port
* pyzmq
* cairo-cffi
* a bunch of others
So relatively a lot given that the project is not even a year old (it
got 0.1 release in June). As per documentation, the advantages over
ctypes:
* The goal is to call C code from Python. You should be able to do so
without learning a 3rd language: every alternative requires you to
learn their own language (Cython, SWIG) or API (ctypes). So we tried
to assume that you know Python and C and minimize the extra bits of
API that you need to learn.
* Keep all the Python-related logic in Python so that you don’t need
to write much C code (unlike CPython native C extensions).
* Work either at the level of the ABI (Application Binary Interface)
or the API (Application Programming Interface). Usually, C libraries
have a specified C API but often not an ABI (e.g. they may document a
“struct” as having at least these fields, but maybe more). (ctypes
works at the ABI level, whereas Cython and native C extensions work at
the API level.)
* We try to be complete. For now some C99 constructs are not
supported, but all C89 should be, including macros (and including
macro “abuses”, which you can manually wrap in saner-looking C
functions).
* We attempt to support both PyPy and CPython, with a reasonable path
for other Python implementations like IronPython and Jython.
* Note that this project is not about embedding executable C code in
Python, unlike Weave. This is about calling existing C libraries from
Python.
so among other things, making a cffi extension gives you the same
level of security as writing C (and unlike ctypes) and brings quite a
bit more flexibility (API vs ABI issue) that let's you wrap arbitrary
libraries, even those full of macros.
Cheers,
fijal
.. [1]: http://cffi.readthedocs.org/en/release-0.5/
hi, everyone:
I want to compile python 3.3 with bz2 support on RedHat 5.5 but fail to do that. Here is how I do it:
1、download bzip2 and compile it(make、make -f Makefile_libbz2_so、make install)
2、chang to python 3.3 source directory : ./configure --with-bz2=/usr/local/include
3、make
4、make install
after installation complete, I test it:
[root@localhost Python-3.3.0]# python3 -c "import bz2"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.3/bz2.py", line 21, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
ImportError: No module named '_bz2'
By the way, RedHat 5.5 has a built-in python 2.4.3. Would it be a problem?
Hi,
I suspect that this will be put into a proper PEP at some point, but I'd
like to bring this up for discussion first. This came out of issues 13429
and 16392.
http://bugs.python.org/issue13429http://bugs.python.org/issue16392
Stefan
The problem
===========
Python modules and extension modules are not being set up in the same way.
For Python modules, the module is created and set up first, then the module
code is being executed. For extensions, i.e. shared libraries, the module
init function is executed straight away and does both the creation and
initialisation. This means that it knows neither the __file__ it is being
loaded from nor its package (i.e. its FQMN). This hinders relative imports
and resource loading. In Py3, it's also not being added to sys.modules,
which means that a (potentially transitive) re-import of the module will
really try to reimport it and thus run into an infinite loop when it
executes the module init function again. And without the FQMN, it's not
trivial to correctly add the module to sys.modules either.
We specifically run into this for Cython generated modules, for which it's
not uncommon that the module init code has the same level of complexity as
that of any 'regular' Python module. Also, the lack of a FQMN and correct
file path hinders the compilation of __init__.py modules, i.e. packages,
especially when relative imports are being used at module init time.
The proposal
============
I propose to split the extension module initialisation into two steps in
Python 3.4, in a backwards compatible way.
Step 1: The current module init function can be reduced to just creating
the module instance and returning it (and potentially doing some simple C
level setup). Optionally, after creating the module (and this is the new
part), the module init code can register a C callback function that will be
called after setting up the module.
Step 2: The shared library importer receives the module instance from the
module init function, adds __file__, __path__, __package__ and friends to
the module dict, and then checks for the callback. If non-NULL, it calls it
to continue the module initialisation by user code.
The callback
============
The callback is defined as follows::
int (*PyModule_init_callback)(PyObject* the_module,
PyModuleInitContext* context)
"PyModuleInitContext" is a struct that is meant mostly for making the
callback more future proof by allowing additional parameters to be passed
in. For now, I can see a use case for the following fields::
struct PyModuleInitContext {
char* module_name;
char* qualified_module_name;
}
Both names are encoded in UTF-8. As for the file path, I consider it best
to retrieve it from the module's __file__ attribute as a Python string
object to reduce filename encoding problems.
Note that this struct argument is not strictly required, but given that
this proposal would have been much simpler if the module init function had
accepted such an argument in the first place, I consider it a good idea not
to let this chance pass by again.
The registration of the callback uses a new C-API function:
int PyModule_SetInitFunction(PyObject* module,
PyModule_init_callback callback)
The function name uses "Set" instead of "Register" to make it clear that
there is only one such function per module.
An alternative would be a new module creation function "PyModule_Create3()"
that takes the callback as third argument, in addition to what
"PyModule_Create2()" accepts. This would require users to explicitly pass
in the (second) version argument, which might be considered only a minor issue.
Implementation
==============
The implementation requires local changes to the extension module importer
and a new C-API function. In order to store the callback, it should use a
new field in the module object struct.
Open questions
==============
It is not clear how extensions should be handled that register more than
one module in their module init function, e.g. compiled packages. One
possibility would be to leave the setup to the user, who would have to know
all FQMNs anyway in this case, although not the import file path.
Alternatively, the import machinery could use a stack to remember for which
modules a callback was registered during the last init function call, set
up all of them and then call their callbacks. It's not clear if this meets
the intention of the user.
Alternatives
============
1) It would be possible to make extension modules optionally export another
symbol, e.g. "PyInit2_modulename", that the shared library loader would
call in addition to the required function "PyInit_modulename". This would
remove the need for a new API that registers the above callback. The
drawback is that it also makes it easier to write broken code because a
Python version or implementation that does not support this second symbol
would simply not call it, without error. The new C-API function would let
the build fail instead if it is not supported.
2) The callback could be made available as a Python function in the module
dict, thus also removing the need for an explicit registration API.
However, this approach would add overhead to both sides, the importer code
and the user provided module init code, as it would require additional
dictionary handling and the implementation of a one-time Python function in
user code. It would also suffer from the problem that missing support in
the runtime would pass silently.
3) The callback could be registered statically in the PyModuleDef struct by
adding a new field. This is not trivial to do in a backwards compatible way
because the struct would grow longer without explicit initialisation by
existing user code. Extending PyModuleDef_HEAD_INIT might be possible but
would still break at least binary compatibility.
4) Pass a new context argument into the module init function that contains
all information necessary to properly and completely set up the module at
creation time. This would provide a much simpler and cleaner solution than
the proposed solution. However, it will not be possible before Python 4 as
it breaks backwards compatibility with all existing extension modules at
both the source and binary level.
>
> From: Eli Bendersky <eliben(a)gmail.com>
>
> I'll be the first one to admit that pycparser is almost certainly not
> generally useful enough to be exposed in the stdlib. So just using it as an
> implementation detail is absolutely fine. PLY is a more interesting
> question, however, since PLY is somewhat more generally useful. That said,
> I see all this as implementation details that shouldn't distract us from
> the main point of whether cffi should be added.
>
Regarding the inclusion of PLY or some subcomponent of it in the standard library, it's not an entirely crazy idea in my opinion. LALR(1) parsers have been around for a long time, are generally known to anyone who's used yacc/bison, and would be useful outside the context of cffi or pycparser. PLY has also been around for about 12 years and is what I would call stable. It gets an update about every year or two, but that's about it. PLY is also relatively small--just two files and about 4300 lines of code (much of which could probably be scaled down a bit).
The only downside to including PLY might be the fact that there are very few people walking around who've actually had to *implement* an LALR(1) parser generator. Some of the code for that is extremely hairy and mathematical. At this time, I don't think there are any bugs in it, but it's not the sort of thing that one wants to wander into casually. Also, there are some horrible hacks in PLY that I'd really like to get rid of, but am currently stuck with due to backwards compatibility issues.
Alex Gaynor has been working on a PLY variant (RPLY) geared at RPython and which has a slightly different programming interface. I'd say if we were to go down this route, he and I should work together to put together some kind of more general "parsing.lalr" package (or similar) that cleans it up and makes it more suitable as a library for building different kinds of parsing tools on top of.
Cheers,
Dave
The original impetus for Argument Clinic was adding introspection
information for builtins--it seemed like any manual approach I came up
with would push the builtins maintenance burden beyond the pale.
Assuming that we have Argument Clinic or something like it, we don't
need to optimize for ease of use from the API end--we can optimize for
data size. So the approach writ large: store a blob of data associated
with each entry point, as small as possible. Reconstitute the
appropriate inspect.Signature on demand by reading that blob.
Where to store the data? PyMethodDef is the obvious spot, but I think
that structure is part of the stable ABI. So we'd need a new
PyMethodDefEx and that'd be a little tiresome. Less violent to the ABI
would be defining a new array of pointers-to-introspection-blobs,
parallel to the PyMethodDef array, passed in via a new entry point.
On to the representation. Consider the function
def foo(arg, b=3, *, kwonly='a'):
pass
I considered four approaches, each listed below along with its total
size if it was stored as C static data.
1. A specialized bytecode format, something like pickle, like this:
bytes([ PARAMETER_START_LENGTH_3, 'a', 'r', 'g',
PARAMETER_START_LENGTH_1, 'b', PARAMETER_DEFAULT_LENGTH_1, '3',
KEYWORD_ONLY,
PARAMETER_START_LENGTH_6, 'k', 'w', 'o', 'n', 'l', 'y',
PARAMETER_DEFAULT_LENGTH_3, '\'', 'a', '\'',
END
])
Length: 20 bytes.
2. Just use pickle--pickle the result of inspect.signature() run on a
mocked-up signature, just store that. Length: 130 bytes. (Assume a
two-byte size stored next to it.)
3. Store a string that, if eval'd, would produce the inspect.Signature.
Length: 231 bytes. (This could be made smaller if we could assume "from
inspect import *" or "p = inspect.Parameter" or something, but it'd
still be easily the heaviest.)
4. Store a string that looks like the Python declaration of the
signature, and parse it (Nick's suggestion). For foo above, this would
be "(arg,b=3,*,kwonly='a')". Length: 23 bytes.
Of those, Nick's suggestion seems best. It's slightly bigger than the
specialized bytecode format, but it's human-readable (and
human-writable!), and it'd be the easiest to implement.
My first idea for implementation: add a "def x" to the front and ":
pass" to the end, then run it through ast.parse. Iterate over the tree,
converting parameters into inspect.Parameters and handling the return
annotation if present. Default values and annotations would be turned
into values by ast.eval_literal. (It wouldn't surprise me if there's a
cleaner way to do it than the fake function definition; I'm not familiar
with the ast module.)
We'd want one more mild hack: the DSL will support positional
parameters, and inspect.Signature supports positional parameters, so
it'd be nice to render that information. But we can't represent that in
Python syntax (or at least not yet!), so we can't let ast.parse see it.
My suggestion: run it through ast.parse, and if it throws a SyntaxError
see if the problem was a slash. If it was, remove the slash, reprocess
through ast.parse, and remember that all parameters are positional-only
(and barf if there are kwonly, args, or kwargs).
Thoughts?
//arry/
https://docs.google.com/document/d/1MKXgPzhWD5wIUpoSQX7dxmqgTZVO6l9iZZis8dn…
PEP: 4XX
Title: Improving Python ZIP Application Support
Author: Daniel Holth <dholth(a)gmail.com>
Status: Draft
Type: Standards Track
Python-Version: 3.4
Created: 30 March 2013
Post-History: 30 March 2013, 1 April 2013
Improving Python ZIP Application Support
Python has had the ability to execute directories or ZIP-format
archives as scripts since version 2.6. When invoked with a zip file or
directory as its first argument the interpreter adds that directory to
sys.path and executes the __main__ module. These archives provide a
great way to publish software that needs to be distributed as a single
file script but is complex enough to need to be written as a
collection of modules.
This feature is not as popular as it should be, mainly because no
one’s heard of it because it wasn’t promoted as part of Python 2.6,
but also because Windows users don’t have a file extension (other than
.py) to associate with the launcher.
This PEP proposes to fix these problems by re-publicising the feature,
defining the .pyz and .pyzw extensions as “Python ZIP Applications”
and “Windowed Python ZIP Applications”, and providing some simple
tooling to manage the format.
A New Python ZIP Application Extension
The Python 3.4 installer will associate .pyz and .pyzw “Python ZIP
Applications” with the platform launcher so they can be executed. A
.pyz archive is a console application and a .pyzw archive is a
windowed application, indicating whether the console should appear
when running the app.
Why not use .zip or .py? Users expect a .zip file would be opened with
an archive tool, and users expect .py to be opened with a text editor.
Both would be confusing for this use case.
For UNIX users, .pyz applications should be prefixed with a #! line
pointing to the correct Python interpreter and an optional
explanation.
#!/usr/bin/env python3
# This is a Python application stored in a ZIP archive.
(binary contents of archive)
As background, ZIP archives are defined with a footer containing
relative offsets from the end of the file. They remain valid when
concatenated to the end of any other file. This feature is completely
standard and is how self-extracting ZIP archives and the bdist_wininst
installer format work.
Minimal Tooling: The pyzaa Module
This PEP also proposes including a simple application for working with
these archives: The Python Zip Application Archiver “pyzaa” (rhymes
with “huzzah” or “pizza”). “pyzaa” can archive or extract these files,
compile bytecode, and can write the __main__ module if it is not
present.
Usage
python -m pyzaa (pack | compile)
python -m pyzaa pack [-o path/name] [-m module.submodule:callable]
[-c] [-w] [-p interpreter] directory:
ZIP the contents of directory as directory.pyz or [-w]
directory.pyzw. Adds the executable flag to the archive.
-c compile .pyc files and add them to the archive
-p interpreter include #!interpreter as the first line of the archive
-o path/name archive is written to path/name.pyz[w] instead of
dirname. The extension is added if not specified.
-m module.submodule:callable __main__.py is written as “import
module.submodule; module.submodule.callable()”
pyzaa pack will warn if the directory contains C extensions or if
it doesn’t contain __main__.py.
python -m pyzaa compile arcname.pyz[w]
The Python files in arcname.pyz[w] are compiled and appended to
the ZIP file.
A standard ZIP utility or Python’s zipfile module can unpack the archives.
FAQ
Q. Isn’t pyzaa just a very thin wrapper over zipfile and compileall?
A. Yes.
Q. How does this compete with existing sdist/bdist formats?
A. There is some overlap, but .pyz files are especially interesting as
a way to distribute an installer. They may also prove useful as a way
to deliver applications when users shouldn’t be asked to perform
virtualenv + “pip install”.
References
[1] http://bugs.python.org/issue1739468 “Allow interpreter to execute
a zip file”
[2] http://bugs.python.org/issue17359 “Feature is not documented”
Copyright
This document has been placed into the public domain.
On 2013-02-26, 16:25 GMT, Terry Reedy wrote:
> On 2/21/2013 4:22 PM, Matej Cepl wrote:
>> as my method to commemorate Aaron Swartz, I have decided to port his
>> html2text to work fully with the latest python 3.3. After some time
>> dealing with various bugs, I have now in my repo
>> https://github.com/mcepl/html2text (branch python3) working solution
>> which works all the way to python 3.2 (inclusive;
>> https://travis-ci.org/mcepl/html2text). However, the last problem
>> remains. This
>>
>> <li>Run this command:
>> <pre>ls -l *.html</pre></li>
>> <li>?</li>
>>
>> should lead to
>>
>> * Run this command:
>>
>> ls -l *.html
>>
>> * ?
>>
>> but it doesn’t. It leads to this (with python 3.3 only)
>>
>> * Run this command:
>> ls -l *.html
>>
>> * ?
>>
>> Does anybody know about something which changed in modules re or
>> http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and
>> 3.3, which could influence this script?
>
> Search the changelob or 3.3 misc/News for items affecting those two
> modules. There are at least 4.
> http://docs.python.org/3.3/whatsnew/changelog.html
>
> It is faintly possible that the switch from narrow/wide builds to
> unified builds somehow affected that. Have you tested with 2.7/3.2 on
> both narrow and wide unicode builds?
So, in the end, I have went the long way and bisected cpython to
find the commit which broke my tests, and it seems that the
culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is
clearly something Unicode related.
Unfortunately, it really doesn't tell me what exactly is broken
(is it a known regression) and if there is known workaround.
Could anybody suggest a way how to find bugs on
http://bugs.python.org related to some particular commit (plain
search for 123f2dc0 didn’t find anything).
Any thoughts?
Matěj
P.S.: Crossposting to python-devel in hope there would be
somebody understanding more about that particular commit. For
that I have also intentionally not trim the original messages to
preserve context.
--
http://www.ceplovi.cz/matej/, Jabber: mcepl<at>ceplovi.cz
GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC
When you're happy that cut and paste actually works I think it's
a sign you've been using X-Windows for too long.
-- from /. discussion on poor integration between KDE and
GNOME