tl;dr The summary is that I have a patch that improves CPython
performance up to 5-10% on macro benchmarks. Benchmarks results on
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
at . There are no slowdowns that I could reproduce consistently.
There are twodifferent optimizations that yield this speedup:
LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.
LOAD_METHOD & CALL_METHOD
We had a lot of conversations with Victor about his PEP 509, and he sent
me a link to his amazing compilation of notes about CPython performance
. One optimization that he pointed out to me was LOAD/CALL_METHOD
opcodes, an idea first originated in PyPy.
There is a patch that implements this optimization, it's tracked here:
. There are some low level details that I explained in the issue,
but I'll go over the high level design in this email as well.
Every time you access a method attribute on an object, a BoundMethod
object is created. It is a fairly expensive operation, despite a
freelist of BoundMethods (so that memory allocation is generally
avoided). The idea is to detect what looks like a method call in the
compiler, and emit a pair of specialized bytecodes for that.
So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have
LOAD_METHOD looks at the object on top of the stack, and checks if the
name resolves to a method or to a regular attribute. If it's a method,
then we push the unbound method object and the object to the stack. If
it's an attribute, we push the resolved attribute and NULL.
When CALL_METHOD looks at the stack it knows how to call the unbound
method properly (pushing the object as a first arg), or how to call a
This idea does make CPython faster around 2-4%. And it surely doesn't
make it slower. I think it's a safe bet to at least implement this
optimization in CPython 3.6.
So far, the patch only optimizes positional-only method calls. It's
possible to optimize all kind of calls, but this will necessitate 3 more
opcodes (explained in the issue). We'll need to do some careful
benchmarking to see if it's really needed.
Per-opcode cache in ceval
While reading PEP 509, I was thinking about how we can use
dict->ma_version in ceval to speed up globals lookups. One of the key
assumptions (and this is what makes JITs possible) is that real-life
programs don't modify globals and rebind builtins (often), and that most
code paths operate on objects of the same type.
In CPython, all pure Python functions have code objects. When you call
a function, ceval executes its code object in a frame. Frames contain
contextual information, including pointers to the globals and builtins
dict. The key observation here is that almost all code objects always
have same pointers to the globals (the module they were defined in) and
to the builtins. And it's not a good programming practice to mutate
globals or rebind builtins.
Let's look at this function:
Here are its opcodes:
2 0 LOAD_GLOBAL 0 (print)
3 LOAD_GLOBAL 1 (ham)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
10 LOAD_CONST 0 (None)
The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at
the first one, that loads the 'print' function from builtins. The
opcode knows the following bits of information:
- its offset (0),
- its argument (0 -> 'print'),
- its type (LOAD_GLOBAL).
And these bits of information will *never* change. So if this opcode
could resolve the 'print' name (from globals or builtins, likely the
latter) and save the pointer to it somewhere, along with
globals->ma_version and builtins->ma_version, it could, on its second
call, just load this cached info back, check that the globals and
builtins dict haven't changed and push the cached ref to the stack.
That would save it from doing two dict lookups.
We can also optimize LOAD_METHOD. There are high chances, that 'obj' in
'obj.method()' will be of the same type every time we execute the code
object. So if we'd have an opcodes cache, LOAD_METHOD could then cache
a pointer to the resolved unbound method, a pointer to obj.__class__,
and tp_version_tag of obj.__class__. Then it would only need to check
if the cached object type is the same (and that it wasn't modified) and
that obj.__dict__ doesn't override 'method'. Long story short, this
caching really speeds up method calls on types implemented in C.
list.append becomes very fast, because list doesn't have a __dict__, so
the check is very cheap (with cache).
A straightforward way to implement such a cache is simple, but consumes
a lot of memory, that would be just wasted, since we only need such a
cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative
about the cache design. Here's what I came up with:
1. We add a few fields to the code object.
2. ceval will count how many times each code object is executed.
3. When the code object is executed over ~900 times, we mark it as
"hot". We also create an 'unsigned char' array "MAPPING", with length
set to match the length of the code object. So we have a 1-to-1 mapping
between opcodes and MAPPING array.
4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
LOAD_METHOD do "MAPPING[opcode_offset()]++".
5. After 1024 calls to the code object, ceval loop will iterate through
the MAPPING, counting all opcodes that were executed more than 50 times.
6. We then create an array of cache structs "CACHE" (here's a link to
the updated code.h file: ). We update MAPPING to be a mapping
between opcode position and position in the CACHE. The code object is
7. When the code object is "optimized", LOAD_METHOD and LOAD_GLOBAL use
the CACHE array for fast path.
8. When there is a cache miss, i.e. the builtins/global/obj.__dict__
were mutated, the opcode marks its entry in 'CACHE' as deoptimized, and
it will never try to use the cache again.
Here's a link to the issue tracker with the first version of the patch:
. I'm working on the patch in a github repo here: .
There are many things about this algorithm that we can improve/tweak.
Perhaps we should profile code objects longer, or account for time they
were executed. Maybe we shouldn't deoptimize opcodes on their first
cache miss. Maybe we can come up with better data structures. We also
need to profile the memory and see how much more this cache will require.
One thing I'm certain about, is that we can get a 5-10% speedup of
CPython with relatively low memory impact. And I think it's worth
If you're interested in these kind of optimizations, please help with
code reviews, ideas, profiling and benchmarks. The latter is especially
important, I'd never imagine how hard it is to come up with a good macro
I also want to thank my company MagicStack (magic.io) for sponsoring
Saw recent discussion:
I remember trying WPython; it was fast. Unfortunately it feels it came at
the wrong time when development was invested in getting py3k out the door.
It also had a lot of other ideas like *_INT instructions which allowed
having oparg to be a constant int rather than needing to LOAD_CONST one.
Anyways I'll stop reminiscing
abarnert has started an experiment with wordcode:
I've personally benchmarked this fork with positive results. This
experiment seeks to be conservative-- it doesn't seek to introduce new
opcodes or combine BINARY_OP's all into a single op where the currently
unused-in-wordcode arg then states the kind of binary op (à la COMPARE_OP).
I've submitted a pull request which is working on fixing tests & updating
Bringing this up on the list to figure out if there's interest in a basic
wordcode change. It feels like there's no downsides: faster code, smaller
bytecode, simpler interpretation of bytecode (The Nth instruction starts at
the 2Nth byte if you count EXTENDED_ARG as an instruction). The only
downside is the transitional cost
What'd be necessary for this to be pulled upstream?
after talking to Guido and Serhiy we present the next revision
of this PEP. It is a compromise that we are all happy with,
and a relatively restricted rule that makes additions to PEP 8
I think the discussion has shown that supporting underscores in
the from-string constructors is valuable, therefore this is now
added to the specification section.
The remaining open question is about the reverse direction: do
we want a string formatting modifier that adds underscores as
Title: Underscores in Numeric Literals
Author: Georg Brandl, Serhiy Storchaka
Type: Standards Track
Post-History: 10-Feb-2016, 11-Feb-2016
Abstract and Rationale
This PEP proposes to extend Python's syntax and number-from-string
constructors so that underscores can be used as visual separators for
digit grouping purposes in integral, floating-point and complex number
This is a common feature of other modern languages, and can aid
readability of long literals, or literals whose value should clearly
separate into parts, such as bytes or words in hexadecimal notation.
# grouping decimal numbers by thousands
amount = 10_000_000.0
# grouping hexadecimal addresses by words
addr = 0xDEAD_BEEF
# grouping bits into nibbles in a binary literal
flags = 0b_0011_1111_0100_1110
# same, for string conversions
flags = int('0b_1111_0000', 2)
The current proposal is to allow one underscore between digits, and
after base specifiers in numeric literals. The underscores have no
semantic meaning, and literals are parsed as if the underscores were
The production list for integer literals would therefore look like
integer: decinteger | bininteger | octinteger | hexinteger
decinteger: nonzerodigit (["_"] digit)* | "0" (["_"] "0")*
bininteger: "0" ("b" | "B") (["_"] bindigit)+
octinteger: "0" ("o" | "O") (["_"] octdigit)+
hexinteger: "0" ("x" | "X") (["_"] hexdigit)+
bindigit: "0" | "1"
hexdigit: digit | "a"..."f" | "A"..."F"
For floating-point and complex literals::
floatnumber: pointfloat | exponentfloat
pointfloat: [digitpart] fraction | digitpart "."
exponentfloat: (digitpart | pointfloat) exponent
digitpart: digit (["_"] digit)*
fraction: "." digitpart
exponent: ("e" | "E") ["+" | "-"] digitpart
imagnumber: (floatnumber | digitpart) ("j" | "J")
Following the same rules for placement, underscores will be allowed in
the following constructors:
- ``int()`` (with any base)
Those languages that do allow underscore grouping implement a large
variety of rules for allowed placement of underscores. In cases where
the language spec contradicts the actual behavior, the actual behavior
is listed. ("single" or "multiple" refer to allowing runs of
* Ada: single, only between digits _
* C# (open proposal for 7.0): multiple, only between digits _
* C++14: single, between digits (different separator chosen) _
* D: multiple, anywhere, including trailing _
* Java: multiple, only between digits _
* Julia: single, only between digits (but not in float exponent parts)
* Perl 5: multiple, basically anywhere, although docs say it's
restricted to one underscore between digits _
* Ruby: single, only between digits (although docs say "anywhere")
* Rust: multiple, anywhere, except for between exponent "e" and digits
* Swift: multiple, between digits and trailing (although textual
description says only "between digits") _
Underscore Placement Rules
Instead of the relatively strict rule specified above, the use of
underscores could be limited. As we seen from other languages, common
* Only one consecutive underscore allowed, and only between digits.
* Multiple consecutive underscores allowed, but only between digits.
* Multiple consecutive underscores allowed, in most positions except
for the start of the literal, or special positions like after a
The syntax in this PEP has ultimately been selected because it covers
the common use cases, and does not allow for syntax that would have to
be discouraged in style guides anyway.
A less common rule would be to allow underscores only every N digits
(where N could be 3 for decimal literals, or 4 for hexadecimal ones).
This is unnecessarily restrictive, especially considering the
separator placement is different in different cultures.
A proposed alternate syntax was to use whitespace for grouping.
Although strings are a precedent for combining adjoining literals, the
behavior can lead to unexpected effects which are not possible with
underscores. Also, no other language is known to use this rule,
except for languages that generally disregard any whitespace.
C++14 introduces apostrophes for grouping (because underscores
introduce ambiguity with user-defined literals), which is not
considered because of the use in Python's string literals. _
It has been proposed _ to extend the number-to-string formatting
language to allow ``_`` as a thousans separator, where currently only
``,`` is supported. This could be used to easily generate code with
more readable literals.
A preliminary patch that implements the specification given above has
been posted to the issue tracker. _
..  http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3499.html
..  http://dlang.org/spec/lex.html#integerliteral
..  http://perldoc.perl.org/perldata.html#Scalar-value-constructors
..  http://doc.rust-lang.org/reference.html#number-literals
..  https://github.com/dotnet/roslyn/issues/216
..  http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4
..  http://ruby-doc.org/core-2.3.0/doc/syntax/literals_rdoc.html#label-Numbers
..  https://mail.python.org/pipermail/python-dev/2016-February/143283.html
..  http://bugs.python.org/issue26331
This document has been placed in the public domain.
There is an old discussion about the performance of PyMem_Malloc()
memory allocator. CPython is stressing a lot memory allocators. Last
time I made statistics, it was for the PEP 454:
"For example, the Python test suites calls malloc() , realloc() or
free() 270,000 times per second in average."
I proposed a simple change: modify PyMem_Malloc() to use the pymalloc
allocator which is faster for allocation smaller than 512 bytes, or
fallback to malloc() (which is the current internal allocator of
This tiny change makes Python up to 6% faster on some specific (macro)
benchmarks, and it doesn't seem to make Python slower on any
Do you see any drawback of using pymalloc for PyMem_Malloc()?
Does anyone recall the rationale to have two families to memory allocators?
FYI Python has 3 families since 3.4: PyMem, PyObject but also PyMem_Raw!
Since pymalloc is only used for small memory allocations, I understand
that small objects will not more be allocated on the heap memory, but
only in pymalloc arenas which are allocated by mmap. The advantage of
arenas is that it's possible to "punch holes" in the memory when a
whole arena is freed, whereas the heap memory has the famous
"fragmentation" issue because the heap is a single contiguous memory
The libc malloc() uses mmap() for allocations larger than a threshold
which is now dynamic, and initialized to 128 kB or 256 kB by default
(I don't recall exactly the default value).
Is there a risk of *higher* memory fragmentation if we start to use
pymalloc for PyMem_Malloc()? Does someone know how to test it?
For the last ~36 hours I have stopped receiving emails for messages
posted in the bug tracker. Is anyone else having this problem? Has
anything changed recently?
I have had it set to send to my gmail.com address since the beginning.
At the moment the last bug message email is
<https://bugs.python.org/issue19959#msg262569> with “Date: Mon, 28 Mar
2016 12:19:49 +0000”. I have checked spam and they are not going
Earlier this year I had to set up a rule to avoid lots of tracker
emails suddenly going to spam. I suspect there was something about the
emails that Google doesn’t like (though I don’t understand the
technical details). Maybe this has recently gotten worse at the Google
Summary: There are two prospective Google Summer of Code (GSOC) students
applying to work on writing a gui interface to the basic pip functions
needed by beginners. I expect Google to accept their proposals. Before
I commit to mentoring a student (sometime in April), I would like to be
sure, by addressing any objections now, that I will be able to commit
the code when ready (August or before).
In February 2015, Raymond Hettinger opened tracker issue
"IDLE to provide menu options for using PIP"
The menu options would presumably open dialog boxes defined in a new
module such as idlelib.pipgui. Raymond gave a list of 9 features he
thought would be useful to pip beginners.
Donald Stufft (pip maintainer) answered that he already wanted someone
to write a pip gui, to be put somewhere, and that he would give advice
on interfacing (which he has).
I answered that I had also had a vague idea of a pip gui, and thought it
should be a stand-alone window invoked by a single IDLE menu item, just
as turtledemo can be now. Instead of multiple dialogs (for multiple
IDLE menu items), there could be, for instance, multiple tabs in a
ttk.Notebook. Some pages might implement more than 1 of the features on
Last September, I did some proof-of-concept experiments and changed the
title to "IDLE to provide menu link to PIP gui". In January, when Terri
Oda requested Core Python GSOC project ideas, I suggested the pip gui
project. I believe Raymond's list can easily be programmed in the time
alloted. I also volunteered to help mentor.
Since then, two students have submitted competent prototypes (on the
tracker issue above) that show that they can write a basic tkinter app
and revise in response to reviews.
My current plan is to add idlelib/pipgui.py (or perhaps pip.py) to 3.5
and 3.6. The file will be structured so that it can either be run as a
separate process ('python -m idlelib.pipgui' either at a console or in a
subprocess call) or imported into a running process. IDLE would
currently use a subprocess call, but if IDLE is restructured into a
single-window, multi-tab application, it might switch to using an import.
I would document the new IDLE menu entry in the current IDLE page.
Separately from the pip gui project, I plan, at some point, to add a new
'idlelib' section that documents public entry points to generally useful
idlelib components. If I do that before next August, I would add an
entry for pipgui (which would say that details of the GUI are subject to
1. One might argue that if pipgui is written so as to not depend on
IDLE, then it, like turtledemo, should be located elsewhere, possibly in
Tools/scrips. I would answer that managing packages, unlike running
turtle demos, *is* an IDE function.
2. One might argue that adding a new module with a public entry point,
in a maintenance release, somehow abuses the license granted by PEP434,
in a way that declaring a public interface in an existing module would
not. If this is sustained, I could not document the new module for 3.5.
Terry Jan Reedy
Python 3 becomes more and more popular and is close to a dangerous point
where it can become popular that Python 2. The PSF decided that it's
time to elaborate a new secret plan to ensure that Python users suffer
again with a new major release breaking all their legacy code.
The PSF is happy to announce that the new Python release will be
Why the version 8? It's just to be greater than Perl 6 and PHP 7, but
it's also a mnemonic for PEP 8. By the way, each minor release will now
multiply the version by 2. With Python 8 released in 2016 and one
release every two years, we will beat Firefox 44 in 2022 (Python 64) and
Windows 2003 in 2032 (Python 2048).
A major release requires a major change to justify a version bump: the
new killer feature is that it's no longer possible to import a module
which does not respect the PEP 8. It ensures that all your code is pure.
$ python8 -c 'import keyword'
Lib/keyword.py:16:1: E122 continuation line missing indentation or outdented
Lib/keyword.py:16:1: E265 block comment should start with '# '
Lib/keyword.py:50:1: E122 continuation line missing indentation or outdented
ImportError: no pep8, no glory
Good news: since *no* module of the current standard library of Python 3
respect the PEP 8, the standard library will be simplified to one
unique module, which is new in Python 8: pep8. The standard library will
move to the Python Cheeseshop (PyPI), to reply to an old and popular
DON'T PANIC! You are still able to import your legacy code into
Python 8, you just have to rename all your modules to add a "_noqa" suffix
to the filename. For example, rename utils.py to utils_noqa.py. A side
effect is that you have to update all imports. For example, replace
"import django" with "import django_noqa". After a study of the PSF,
it's a best option to split again the Python community and make sure
that all users are angry.
The plan is that in 10 years, at least 50% of the 77,000 packages on the
Python cheeseshop will be updated to get the "_noqa" tag. After 2020,
the PSF will start to sponsor trolls to harass users of the legacy
Python 3 to force them to migrate to Python 8.
Python 8 is a work-in-progress (it's still an alpha version), the
standard library was not removed yet. Hopefully, trying to import any
module of the standard library fails.
Don't hesitate to propose more ideas to make Python 8 more incompatible
with Python 3!
Note: The change is already effective in the default branch of Python:
On Fri, Apr 1, 2016 at 8:44 AM, Victor Stinner <victor.stinner(a)gmail.com> wrote:
> You should now try Python 8 and try find if a module can still be imported ;-)
Okay.... I can fire up interactive Python and 'import this'. But I
can't run 'make'. This will be interesting!