webmaster has already heard from 4 people who cannot install it.
I sent them to the bug tracker or to python-list but they seem
not to have gone either place. Is there some guide I should be
sending them to, 'how to debug installation problems'?
Laura
Hi,
tl;dr The summary is that I have a patch that improves CPython
performance up to 5-10% on macro benchmarks. Benchmarks results on
Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available
at [1]. There are no slowdowns that I could reproduce consistently.
There are twodifferent optimizations that yield this speedup:
LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop.
LOAD_METHOD & CALL_METHOD
-------------------------
We had a lot of conversations with Victor about his PEP 509, and he sent
me a link to his amazing compilation of notes about CPython performance
[2]. One optimization that he pointed out to me was LOAD/CALL_METHOD
opcodes, an idea first originated in PyPy.
There is a patch that implements this optimization, it's tracked here:
[3]. There are some low level details that I explained in the issue,
but I'll go over the high level design in this email as well.
Every time you access a method attribute on an object, a BoundMethod
object is created. It is a fairly expensive operation, despite a
freelist of BoundMethods (so that memory allocation is generally
avoided). The idea is to detect what looks like a method call in the
compiler, and emit a pair of specialized bytecodes for that.
So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have
LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD.
LOAD_METHOD looks at the object on top of the stack, and checks if the
name resolves to a method or to a regular attribute. If it's a method,
then we push the unbound method object and the object to the stack. If
it's an attribute, we push the resolved attribute and NULL.
When CALL_METHOD looks at the stack it knows how to call the unbound
method properly (pushing the object as a first arg), or how to call a
regular callable.
This idea does make CPython faster around 2-4%. And it surely doesn't
make it slower. I think it's a safe bet to at least implement this
optimization in CPython 3.6.
So far, the patch only optimizes positional-only method calls. It's
possible to optimize all kind of calls, but this will necessitate 3 more
opcodes (explained in the issue). We'll need to do some careful
benchmarking to see if it's really needed.
Per-opcode cache in ceval
-------------------------
While reading PEP 509, I was thinking about how we can use
dict->ma_version in ceval to speed up globals lookups. One of the key
assumptions (and this is what makes JITs possible) is that real-life
programs don't modify globals and rebind builtins (often), and that most
code paths operate on objects of the same type.
In CPython, all pure Python functions have code objects. When you call
a function, ceval executes its code object in a frame. Frames contain
contextual information, including pointers to the globals and builtins
dict. The key observation here is that almost all code objects always
have same pointers to the globals (the module they were defined in) and
to the builtins. And it's not a good programming practice to mutate
globals or rebind builtins.
Let's look at this function:
def spam():
print(ham)
Here are its opcodes:
2 0 LOAD_GLOBAL 0 (print)
3 LOAD_GLOBAL 1 (ham)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at
the first one, that loads the 'print' function from builtins. The
opcode knows the following bits of information:
- its offset (0),
- its argument (0 -> 'print'),
- its type (LOAD_GLOBAL).
And these bits of information will *never* change. So if this opcode
could resolve the 'print' name (from globals or builtins, likely the
latter) and save the pointer to it somewhere, along with
globals->ma_version and builtins->ma_version, it could, on its second
call, just load this cached info back, check that the globals and
builtins dict haven't changed and push the cached ref to the stack.
That would save it from doing two dict lookups.
We can also optimize LOAD_METHOD. There are high chances, that 'obj' in
'obj.method()' will be of the same type every time we execute the code
object. So if we'd have an opcodes cache, LOAD_METHOD could then cache
a pointer to the resolved unbound method, a pointer to obj.__class__,
and tp_version_tag of obj.__class__. Then it would only need to check
if the cached object type is the same (and that it wasn't modified) and
that obj.__dict__ doesn't override 'method'. Long story short, this
caching really speeds up method calls on types implemented in C.
list.append becomes very fast, because list doesn't have a __dict__, so
the check is very cheap (with cache).
A straightforward way to implement such a cache is simple, but consumes
a lot of memory, that would be just wasted, since we only need such a
cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative
about the cache design. Here's what I came up with:
1. We add a few fields to the code object.
2. ceval will count how many times each code object is executed.
3. When the code object is executed over ~900 times, we mark it as
"hot". We also create an 'unsigned char' array "MAPPING", with length
set to match the length of the code object. So we have a 1-to-1 mapping
between opcodes and MAPPING array.
4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
LOAD_METHOD do "MAPPING[opcode_offset()]++".
5. After 1024 calls to the code object, ceval loop will iterate through
the MAPPING, counting all opcodes that were executed more than 50 times.
6. We then create an array of cache structs "CACHE" (here's a link to
the updated code.h file: [6]). We update MAPPING to be a mapping
between opcode position and position in the CACHE. The code object is
now "optimized".
7. When the code object is "optimized", LOAD_METHOD and LOAD_GLOBAL use
the CACHE array for fast path.
8. When there is a cache miss, i.e. the builtins/global/obj.__dict__
were mutated, the opcode marks its entry in 'CACHE' as deoptimized, and
it will never try to use the cache again.
Here's a link to the issue tracker with the first version of the patch:
[5]. I'm working on the patch in a github repo here: [4].
Summary
-------
There are many things about this algorithm that we can improve/tweak.
Perhaps we should profile code objects longer, or account for time they
were executed. Maybe we shouldn't deoptimize opcodes on their first
cache miss. Maybe we can come up with better data structures. We also
need to profile the memory and see how much more this cache will require.
One thing I'm certain about, is that we can get a 5-10% speedup of
CPython with relatively low memory impact. And I think it's worth
exploring that!
If you're interested in these kind of optimizations, please help with
code reviews, ideas, profiling and benchmarks. The latter is especially
important, I'd never imagine how hard it is to come up with a good macro
benchmark.
I also want to thank my company MagicStack (magic.io) for sponsoring
this work.
Thanks,
Yury
[1] https://gist.github.com/1st1/aed69d63a2ff4de4c7be
[2] http://faster-cpython.readthedocs.org/index.html
[3] http://bugs.python.org/issue26110
[4] https://github.com/1st1/cpython/tree/opcache2
[5] http://bugs.python.org/issue26219
[6]
https://github.com/python/cpython/compare/master...1st1:opcache2?expand=1#d…
Since we're all talking about making Python faster, I thought I'd drop
some previous ideas I've had here in case (1) someone wants to actually
do them, and (2) they really are new ideas that haven't failed in the
past. Mostly I was thinking about startup time.
Here are the list of modules imported on clean startup on my Windows,
US-English machine (from -v and cleaned up a bit):
import _frozen_importlib
import _imp
import sys
import '_warnings'
import '_thread'
import '_weakref'
import '_frozen_importlib_external'
import '_io'
import 'marshal'
import 'nt'
import '_thread'
import '_weakref'
import 'winreg'
import 'zipimport'
import '_codecs'
import 'codecs'
import 'encodings.aliases'
import 'encodings'
import 'encodings.mbcs'
import '_signal'
import 'encodings.utf_8'
import 'encodings.latin_1'
import '_weakrefset'
import 'abc'
import 'io'
import 'encodings.cp437'
import 'errno'
import '_stat'
import 'stat'
import 'genericpath'
import 'ntpath'
import '_collections_abc'
import 'os'
import '_sitebuiltins'
import 'sysconfig'
import '_locale'
import '_bootlocale'
import 'encodings.cp1252'
import 'site'
Obviously the easiest first thing is to remove or delay unnecessary
imports. But a while ago I used a native profiler to trace through this
and the most impactful modules were the encodings:
import 'encodings.mbcs'
import 'encodings.utf_8'
import 'encodings.latin_1'
import 'encodings.cp437'
import 'encodings.cp1252'
While I don't doubt that we need all of these for *some* reason,
aliases, cp437 and cp1252 are relatively expensive modules to import.
Mostly due to having large static dictionaries or data structures
generated on startup.
Given this is static and mostly read-only information[1], I see no
reason why we couldn't either generate completely static versions of
them, or better yet compile the resulting data structures into the core
binary.
([1]: If being able to write to some of the encoding data is used by
some people, I vote for breaking that for 3.6 and making it read-only.)
This is probably the code snippet that bothered me the most:
### Encoding table
encoding_table=codecs.charmap_build(decoding_table)
It shows up in many of the encodings modules, and while it is not a bad
function in itself, we are obviously generating a known data structure
on every startup. Storing these in static data is a tradeoff between
disk space and startup performance, and one I think it likely to be
worthwhile.
Anyway, just an idea if someone wants to try it and see what
improvements we can get. I'd love to do it myself, but when it actually
comes to finding time I keep coming up short.
Cheers,
Steve
P.S. If you just want to discuss optimisation techniques or benchmarking
in general, without specific application to CPython 3.6, there's a whole
internet out there. Please don't make me the cause of a pointless
centithread. :)
I am looking for tutorials on basically how to write the python language.
As well I want to know how to write wrappers for things like alsa and
directx and coreaudio and jackd and pulseaudio.
I would be looking to write wrappers for coreaudio in mac and coreaudio for
windows as well as alsa. i think it would benefit you to have this
information.
I am looking to aid in development of computers and languages and tutorials
on how to write the actual python language would be super useful.
but yeah im looking to write applications like guitarix rakarrack blender
ardour lmms.
I am also going to build my own computers so these tutorials would help a
lot.
anyhow thanks for your time
Check out and cd into Python trunk.
% grep -Ri win16 * | wc
10 66 625
% grep -Ri nextstep | wc
23 119 1328
% grep -Ri rhapsody * | wc
47 269 3390
% grep -Ri msdos * | wc
56 381 3895
% grep -Ri ms-dos * | wc
20 180 1425
win16! *laughs* *wipes tear from eye*
It's currently 2016. Perhaps it's time to remove all vestiges of these
unsupported operating systems nobody's cared about since a year that
started with a '1'?
//arry/
p.s. I suspect some of the uses of "rhapsody" are actually live code
used for OS X. So this isn't necessarily dead code, some of it is
merely long-out-of-date comments.
p.p.s. At least there were no references to "taligent", "copland", or
"gershwin"...!
How to resolve distinguishing between documentation and implementation
if current implementation is incorrect, but third-party code can
implicitly depends on it?
For example see issue26198. Currently buffer overflow of predefined
buffer for "es#" and "et#" format units causes TypeError (with
misleading message, but this is other story). The correct and
*documented* exception is ValueError. User code can depend on current
behavior, because TypeError is what is raised now for this type of
errors, and this is what is raised for other types of errors. Unlikely
authors of such code read the documentation, otherwise this issue would
be reported earlier. On other hand, looks these format units are rarely
used with predefined buffer (never in the stdlib since 3.5).
I think it is obvious that the code in the development branch should be
changed to produce documented and more logical exception. But what about
bugfix releases? Changing the documentation would be misleading,
changing the code can break existing code (unlikely, but).
Hi Yury,
> An off-topic: have you ever tried hg.python.org/benchmarks
> or compare MicroPython vs CPython? I'm curious if MicroPython
> is faster -- in that case we'll try to copy some optimization
> ideas.
I've tried a small number of those benchmarks, but not in any rigorous
way, and not enough to compare properly with CPython. Maybe one day I
(or someone) will get to it and report results :)
One thing that makes MP fast is the use of pointer tagging and
stuffing of small integers within object pointers. Thus integer
arithmetic below 2**30 (on 32-bit arch) requires no heap.
> Do you use opcode dictionary caching only for LOAD_GLOBAL-like
> opcodes? Do you have an equivalent of LOAD_FAST, or you use
> dicts to store local variables?
The opcodes that have dict caching are:
LOAD_NAME
LOAD_GLOBAL
LOAD_ATTR
STORE_ATTR
LOAD_METHOD (not implemented yet in mainline repo)
For local variables we use LOAD_FAST and STORE_FAST (and DELETE_FAST).
Actually, there are 16 dedicated opcodes for loading from positions
0-15, and 16 for storing to these positions. Eg:
LOAD_FAST_0
LOAD_FAST_1
...
Mostly this is done to save RAM, since LOAD_FAST_0 is 1 byte.
> If we change the opcode size, it will probably affect libraries
> that compose or modify code objects. Modules like "dis" will
> also need to be updated. And that's probably just a tip of the
> iceberg.
>
> We can still implement your approach if we add a separate
> private 'unsigned char' array to each code object, so that
> LOAD_GLOBAL can store the key offsets. It should be a bit
> faster than my current patch, since it has one less level
> of indirection. But this way we loose the ability to
> optimize LOAD_METHOD, simply because it requires more memory
> for its cache. In any case, I'll experiment!
Problem with that approach (having a separate array for offset_guess)
is that how do you know where to look into that array for a given
LOAD_GLOBAL opcode? The second LOAD_GLOBAL in your bytecode should
look into the second entry in the array, but how does it know?
I'd love to experiment implementing my original caching idea with
CPython, but no time!
Cheers,
Damien.
Hi,
Summary: FAT Python is not faster, but it will be ;-)
--
When I started the FAT Python as a fork of CPython 3.6, I put
everything in the same repository. Last weeks, I focused on splitting
my giant patch (10k lines) into small reviewable patches. I wrote 3
PEP (509 dict version, 510 function specialziation, 511 code
tranformers) and I enhanced the API to make it usable for more use
cases than just FAT Python. I also created fatoptimizer (the AST
optimizer) and fat (runtime dependency of the optimizer) projects on
GitHub to separate clearly what should be outside Python core. For all
links, see:
http://faster-cpython.readthedocs.org/fat_python.html
For the fatoptimizer project, my constraint is to be able to run the
full Python test suite unmodified. In practice, I have to disable some
optimizations by putting a "__fatoptimizer__= {...}" configuration to
some test files. For example, I have to disable constant folding on
test_bool because it tests that False+2 gives 2 at runtime, whereas
the optimizer replaces directly False+2 with 2 during the compilation.
Well, test_bool.py is not the best example because all tests pass with
the constant folding optimization (if I comment my
"__fatoptimizer__={...}" change).
This constraint ensures that the optimizer "works" and doesn't break
(too much ;-)) the Python semantics, but it's more difficult to
implement powerful optimizations.
I also found and fixed various kinds of bugs. In my code obviously,
but also in the Python core, in various places. Some bugs only concern
AST transformers which is a new feature, but I had to fix them. For
example, Python didn't support negative line number delta in
co_lntotab of code objects, and so line number were all wrong on
optimized code. I merged my enhancement in the default branch of
CPython (issue #26107).
In short, I focused on having something working (respecting the Python
semantics), rather than spending time on writing optimizations.
--
When I asked explicitly "Is someone opposed to this PEP 509 [dict
verion] ?", Barry Warsaw answered that a performance analysis is
required. Extract of his mail:
"I still think this is maintenance and potential performance
overhead we don't want to commit to long term unless it enables
significant optimization. Since you probably can't prove that without
some experimentation, this API should be provisional."
Last week, I ran some benchmarks and I have to admin that I was
disappointed. Not only fatoptimizer doesn't make Python faster, but it
makes it much slower on some tests!
http://fatoptimizer.readthedocs.org/en/latest/benchmarks.html
Quickly, I identified a major performance issue when nested functions
are specialized, especially in Lib/json/encoder.py (tested by
bm_json_v2.py benchmark). I fixed my optimizer to not specialize
nested functions anymore. This simple change fixed the main
performance issue. Reminder: in performance critical code, don't use
nested functions! I will maybe propose patches for Lib/json/encoder.py
to stop using nested functions.
I only ran benchmarks with the optimizer enabled. I now have to
measure the overhead of my patches (PEP 509, 510 and 511) adding the
API fat AST optimizers. The overhead must be negligible. For me, it's
a requirement of the whole project. Changes must not make Python
slower when the optimizer is not used.
fatoptimizer is faster on microbenchmarks, but I had to write manually
some optimizations:
http://fatoptimizer.readthedocs.org/en/latest/microbenchmarks.html
IMHO fatoptimizer is not faster on macro benchmarks because it is not
smart enough (yet) to generate the most interesting optimizations,
like function inlining and specialization for argument types. You can
estimate the speedup if you specialize manually your functions.
--
Barry also wrote: "Did you address my suggestion on python-ideas to
make the new C API optionally compiled in?"
Well, it is an option, but I would prefer to have the API for AST
optimizer directly built in Python.
The first beta version of Python 3.6 is scheduled in September 2016
(deadline for new features in Python 3.6), so I still have a few
months to implement more powerful optimizations and prove that it can
be faster ;-)
Victor