
Hi, I'm only a simple Python developer, not a Type Hinting expert and I don't know if you already discuss about that somewhere, but: With the future official support of Type Hinting in Python, is it means that CPython could use this pieces of information to store variables with more efficient data structures, not only check types ? It could possible to have better performance one day with Type Hinting like you have with Cython (explicit types declaration) or PyPy (guessing types) ? Is it realistic to except that one day, or I've missed some mountains ;-) ? If this is correct, better performances will be a great consequence for Type Hinting, more people will be interested in by this feature, as we have with AsyncIO (BTW, I'm working to publish benchmarks on this, I'll publish that on AsyncIO ML). Regards. -- Ludovic Gasc

On 21 December 2014 at 09:55, Ludovic Gasc <gmludo@gmail.com> wrote:
The primary goals of the type hinting standardisation effort are improving program correctness (through enhanced static analysis) and API documentation (through clearer communication of expectations for input and output, both in the code, in documentation, and in IDEs). It should also allow the development of more advanced techniques for function signature based dispatch and other forms of structured pattern matching.
There's also the fact that with both Numba and PyPy now supporting selective JIT acceleration of decorated functions within the context of a larger CPython application, as well as Cython's existing support for precompilation as a C extension, the pattern of profiling to find performance critical areas, and finding ways to optimise those, now seems well established. (Hence my suggestion the other day that we could likely use an introductory how to guide on performance profiling, which could also provide suggestions for optimisation tools to explore once the hot spots have been found). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

What Nick is trying to say is that mypy and other optimizers have been quite successful without type hints. I believe that Jim Baker is hopeful that he will be able to do some interesting things for Jython with type hints though. --Guido On Sat, Dec 20, 2014 at 5:15 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Hi Nick, Thanks for your answer. I understand primary goal, I'm not completely naive on this question: A long time ago, I used a lot Type Hinting with PHP 5. Nevertheless, in Python community, you can find a lot of libraries to improve performance based on types handling with different optimization strategies (numba, pypy, pythran, cython, llvm-py, shedskin...). To my knowledge, you don't have the same number of libraries to do that with another dynamic language. It means that in Python community we have this problematic. It's like with async pattern, in Python you have a plenty of libraries (Twisted, eventlet, gevent, stackless, tornado...) and now, with AsyncIO, the community should converge on it. And yes, I understand that it's almost impossible to create a silver bullet to improve automagically performance, but, as with my simple dev eyes, the common pattern I see with pythran, cython... is the type handling. They don't use only this strategy to improve performance, but it's the biggest visible part in example codes I've seen. Guido: " optimizers have been quite successful without type hints" <= Certainly, but instead of to loose time to try to guess the right data structure, maybe it could be faster that the developer gives directly what he wants. To be honest, I'm a little bit tired to listen some bias like "Python is slow", "not good for performance", "you must use C/Java/Erlang/Go..." For me, Python has the right compromise to write quickly readable source code and performance possibilities to speed up your code. More we have primitives in CPython to build performant applications, more it will be easier to convince people to use Python. Regards. -- Ludovic Gasc On 21 Dec 2014 02:15, "Nick Coghlan" <ncoghlan@gmail.com> wrote:

On Mon, Dec 22, 2014 at 6:23 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
There may be something to that. Most of the people I've heard moaning that "Python is slow" aren't backing that up with any actual facts. Telling people "Add these type hints to your code if you feel like it" might satisfy their need for red tape and the gut feeling that it's speeding up the code. ChrisA

On 22/12/14 00:20, Chris Angelico wrote:
There may be something to that. Most of the people I've heard moaning that "Python is slow" aren't backing that up with any actual facts.
One fact is that the fastest Random Forests classifier known to man is written in Python (with addition of some Cython in critical places). For those who don't know, Random Fortests is one of the strongest algorithms (if not the strongest) for estimating or predicting a probability p as a non-linear function of N variables. Here are some interesting slides: http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn Just look at what they could make Python and Cython do, compared to e.g. the pure C++ solution in OpenCV. Even more interesting, the most commonly used implementation of Random Forests is the version for R, which is written in Fortran. There is actually a pure Python version which is faster... So who says Python is slow? I think today, mostly people who don't know what they are talking about. We see Python running at the biggest HPC systems today. We see Python algorithms beating anything we can throw at it with C++ or Fortran. It is time to realize that the flexibility of Python is not just a slowdown, it is also a speed boost because it makes it easier to write flexible and complex algorithms. But yes, Python needs help from Numba or Cython, and sometimes even Fortran (f2py), to achieve its full potential. The main take-home lesson from those slides, by the way, is the need to (1) profile Python code to identify bottlenecks -- humans are very bad at this kind of guesswork -- and (2) the need to use code annotation in Cython (compile with -a) to limit the Python overhead. Sturla

On 22 December 2014 at 22:05, Sturla Molden <sturla.molden@gmail.com> wrote:
Yep - the challenge is to: a) communicate the potential of these tools effectively to new Python users b) ensure these tools are readily available to them (without turning into a build nightmare) A profiling HOWTO (especially one that highlights visual tools like SnakeViz) may be helpful on the first point, but the second one is a fair bit harder to address in the general case. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thank you everybody for all pieces of information, very interesting. I reply to everybody in the same e-mail: 1. For Python 3 usage: Where I work, we switched all new projects to Python 3 since almost one year. To be honest with you, it wasn't to be up-to-date, but it was for a new feature not natively present in Python 2: AsyncIO. We made a lot of Twisted (Telephony+WebSockets) and Django/Flask daemons (WebServices), but I wanted to: A. Create daemons all-in-one (Telephony+WebSockets+WebServices) to share easier our business logic for a same project. B. Simplify architecture: Twisted and Django are very complicated for our simple needs. The side effects are: A. Our productivity is better, we finish quicker our projects for our clients, because we share more source code and the architecture is simpler to handle. B. The performances we have are better compare to the past. Why I tell you that ? To give you a concrete example: if you want to motivate Python developers from the battlefield to migrate to Python 3, you need to add features/performances/... in Python 3. Not add all PyPI libraries in CPython, but add features you can't add easily via a library, like yield from, AsyncIO or Type Hinting. On production systems, who cares is Python 2/3, Go, Erlang... ? Certainly not clients and even management people. If you want that we use Python 3, please give us arguments to "sell" the migration to Python 3 in the company. 2. For Python 3 deployment: I use Pythonz: https://github.com/saghul/pythonz to quickly deploy a Python version on a new server with an Ansible recipe. Certainly, some sys admins could be shocked by this behaviour because it's forbidden in packaging religion, but who cares ? We lose less time to deploy, it's more reproducible between dev environment and production, and upgrades are very easy via Ansible. 3. About PyPy usage: I've made some benchmarks with the same WebService between a Flask daemon on PyPy and an AsyncIO daemon on CPython, it was very interesting: Compare to our needs, asynchronous pattern give us more performance that PyPy. Yes, I know, I've compared apples with pears, but at the end, I only want how many customers I can stack on the same server. More I stack, less it costs for my company. I'm waiting Python 3.3 support in PyPy to push that on production with a real daemon. No AsyncIO, No PyPy. 4. About Python performance: Two things changed my mind about the bias "Python is slow": A. Python High Performance: http://shop.oreilly.com/product/0636920028963.do B. Web Framework Benchmarks: http://www.techempower.com/benchmarks/ Especially with B.: this benchmark isn't the "truth", but at least, you can compare a lot of languages/frameworks based on examples more closer than my use cases, compare to others benchmarks. But, for example, if you compare Python frameworks with Erlang frameworks on "multiple queries", you can see that, in fact, Python is very good. In my mind, Erlang is very complicated to code compare to Python, but you'll have better performances in all cases, I had a wrong opinion. Finally, everybody has bias about programming languages performance. 5. Type Hinting - Performance booster (the first goal of my e-mail): Thank you Guido, your example is clear. I understand that it will be step-by-step, it's a good thing. I took the liberty to send an e-mail, because I didn't sure to understand correctly the global roadmap. I'm the first to understand that it's very difficult, and maybe that CPython will never use Type Hinting to improve performances, because it isn't really simple to implement. I'm pretty sure I can't help you to implement that, but at least, I can promote to others. Thank you everybody again for your attention. Met vriendelijke groeten, -- Ludovic Gasc On Mon, Dec 22, 2014 at 6:06 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

Am 23.12.14 um 00:56 schrieb Ludovic Gasc:
I'm waiting Python 3.3 support in PyPy to push that on production with a real daemon. No AsyncIO, No PyPy.
Trollius should work on PyPy already: https://pypi.python.org/pypi/trollius Trollius is a portage of the Tulip project (asyncio module, PEP 3156) on Python 2. Trollius works on Python 2.6-3.5. It has been tested on Windows, Linux, Mac OS X, FreeBSD and OpenIndiana. http://trollius.readthedocs.org/en/latest/changelog.html 2014-07-30: Version 1.0.1 This release supports PyPy and has a better support of asyncio coroutines, especially in debug mode. Mike

On 22 December 2014 at 05:23, Ludovic Gasc <gmludo@gmail.com> wrote:
Perhaps the most effective thing anyone could do to make significant progress in the area of CPython performance is to actually get CodeSpeed working properly with anything other than PyPy, as automated creation of clear metrics like that can be incredibly powerful as a motivational tool (think about the competition on JavaScript benchmarks between different browser vendors, or the way the PyPy team use speed.pypy.org as a measure of their success in making new versions faster). Work on speed.python.org (as a cross-implementation counterpart to speed.pypy.org) was started years ago, but no leader ever emerged to drive the effort to completion (and even a funded development effort by the PSF failed to produce a running instance of the service). Another possible approach would be to create a JavaScript front end for PyPy (along the lines of the PyPy-based Topaz interpreter for Ruby, or the HippyVM interpreter for PHP), and make a serious attempt at displacing V8 at the heart of Node.js. (The Node.js build system for binary extensions is already written in Python, so why not the core interpreter as well? There's also the fact that Node.js regularly ends up running on no longer supported versions of V8, as V8 is written to meet the needs of Chrome, not those of the server-side JavaScript community). One key advantage of the latter approach is that the more general purpose PyPy infrastructure being competitive with the heavily optimised JavaScript interpreters created by the browser vendors on a set of industry standard performance benchmarks is much, much stronger evidence of PyPy's raw speed than being faster than the not-known-for-its-speed CPython interpreter on a set of benchmarks originally chosen specifically by Google for the Unladen Swallow project. Even with Topaz being one of the fastest Ruby interpreters, to the point of Oracle Labs using it as a relative benchmark for comparison of JRuby's performance in http://www.slideshare.net/ThomasWuerthinger/graal-truffle-ethdec2013, that's still relatively weak evidence for raw speed, since Ruby in general is also not well known for being fast. (Likewise, HippyVM being faster than Facebook's HHVM is impressive, but vulnerable to the same counter-argument that people make for Python and Ruby, "If you care about raw speed, why are you still using PHP?") Objective benchmarks and real world success stories are the kinds of things that people find genuinely persuasive - otherwise we're just yet another programming language community making self-promoting claims on the internet without adequate supporting evidence. ( http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf is an example of providing good supporting evidence that compares Numba and Cython, amongst many other alternatives, to the speed of raw C++ and FORTRAN for evaluation of a particular numeric model - given the benefits they offer in maintainability relative to the lower level languages, they fare extremely well on the speed front) As things stand, we have lots of folks wanting *someone else* to do the work to counter the inaccurate logic of "CPython-the-implementation tends to prioritise portability and maintainability over raw speed, therefore Python-the-language is inherently slow", yet very few volunteering to actually do the work needed to counter it effectively in a global sense (rather than within the specific niches currently targeted by the PyPy, Numba, and Cython development teams - those teams do some extraordinarily fine work that doesn't get the credit it deserves due to a mindset amongst many users that only CPython performance counts in cross-language comparisons). Regards, Nick. P.S. As noted earlier, a profiling and optimising HOWTO in the standard documentation set would also make a lot of sense as a way of making these alternatives more discoverable, but again, it needs a volunteer to write it (or at least an initial draft which then be polished in review on Reitveld). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Ludovic, If I understand you correctly you would like type annotations to be used to speed up Python programs. You are not alone, but it's a really hard problem, because nobody wants their program to break. I think that eventually this will be possible, but in the short term, it would be a much, much bigger effort to create such a compiler than it is just to define (and reach agreement on) the syntax for annotations. I like to take smaller steps first, especially when it comes to defining syntax. Maybe once we have a standard type hinting notation someone will write a compiler. In the meantime, let me give you an example of why it's such a hard problem. Suppose I have created a library that contains a function that finds the longest of two strings: def longest(a, b): return a if len(a) >= len(b) else b Now I start adding annotations, and I change this to: def longest(a: str, b: str) -> str: return a if len(a) >= len(b) else b But suppose someone else is using the first version of my library and they have found it useful for other sequences besides strings, e.g. they call longest() with two lists. That wasn't what I had written the function for. But their program works. Should it be broken when the annotations are added? Who gets the blame? How should it be fixed? Now, I intentionally chose this example to be quite controversial, and there are different possible answers to these questions. But the bottom line is that this kind of situation (and others that can be summarized as "duck typing") make it hard to create a Python compiler -- so hard that nobody has yet cracked the problem completely. Maybe once we have agreed on type hinting annotations someone will figure it out. In the meantime, I think "offline" static type checking such as done by mypy is quite useful in and of itself. Also, IDEs will be able to use the information conveyed by type hinting annotations (especially in stub files). So I am content with limiting the scope of the proposal to these use cases. Salut, --Guido On Sun, Dec 21, 2014 at 11:23 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 22 December 2014 at 14:39, Guido van Rossum <guido@python.org> wrote:
I'd like to echo this particular sentiment, since my other replies could easily give the impression I disagreed. I'm actually optimistic that once this kind of annotation becomes popular, folks will find ways to make use of it for performance improvements (even if it's just priming a JIT cache to reduce warmup time). The only idea I disagree with is that it's necessary to *wait* for this feature for Pythonistais to start countering the "Python is slow" meme - the 3.5 release is still almost a year away, and the migration of the overall ecosystem from Python 2 to 3 is definitely still a work in progress. As a result, I believe that in the near term, it's better to look at the barriers to adoption for existing tools like PyPy, Numba and Cython and start figuring out whether those barriers can be reduced in some way. In the case of PyPy, deployment of web services using it can be awkward relative to CPython. A prebuilt image on DockerHub that includes uWSGI + the PyPy plugin + PyPy could be a good way to reduce barriers to entry for Linux based deployments. A separate prebuilt image on DockerHub could also be a good way to make it easy for folks to experiment with building web services based on the PyPy STM branch. Another barrier to PyPy adoption is the level of use of cffi vs hand crafted C extensions. While the inclusion of pip with CPython makes cffi more readily available, there's also in principle agreement to incorporate it (and its dependencies) into the standard library. Folks interested in increasing PyPy's popularity may want to get in touch with the cffi developers and see what remaining barriers (other than available time) there are to moving forward with that proposal as a PEP for Python 3.5. (Standard library backports have more credibility as dependencies in many situations, since the standard library inclusion acts as a clear endorsement of the original project) For Numba and Cython, challenges are more likely to be centred around the process of getting started with the tools, and providing a suitable build environment. Again, Docker images can help, but in this case, by providing a preconfigured build environment that can either run in a local container on Linux, or under a Linux VM on Mac OS X or Windows. None of those Docker related ideas need to be filtered through the CPython core development team, though, nor even through the development teams of the individual projects. All it takes is someone sufficiently motivated to figure out how to do it, publish their work, and then approach the relevant project to say "Hey, I did this, does anyone else want to help promote and maintain it, and perhaps give it an official project blessing?". Unlike the Docker image ideas, putting together the cffi PEP *does* need involvement from both the developers of the cffi project and the rest of the CPython core development team, but if the cffi folks are amenable, it's conceivable someone else could do the actual PEP wrangling. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21/12/14 02:15, Nick Coghlan wrote:
Numba is not very mature, but it can already JIT accelerate most of Python bytecode to speeds comparable to -O2 in C. And nit just that, it does not need type-hints, it has also an "autojit" that can infere types at runtime. What the Unladen Swallow failed and Numba succeded? I think for two reasons: First, Numba was designed for a particular purpose: numerical computing. No offence to Google, but the numerics folks are the ones who really understand how to beat gigaflops out of the CPU. Numba was created by Travis Oliphant, who was also the creator of the modern NumPy package (as opposed to Jim Hugunin's original Numerics package and NASA's Numarray packages). The Unladen Swallow did the mistake of trying to accelerate "everything". Second, LLVM has improved. At the time the swallow was hatched, LLVM was not really useful as a JIT compiler framework. It was in comparison to GCC even a lousy static compiler. Now it has matured and it is excellent as as a JIT compiler as well as a static compiler. It is still a little behind GCC and Intel compilers on static optimization, but more than good enough to be the default compiler on MacOSX and FreeBSD. It should also be mentioned that the Numba (or just NumbaPro?) integrates with CUDA, and it can actually run Python bytecode on the GPU. For certain data-parallel tasks it can make your Python program yield a teraflop with a modern Nvidia GPU. I also want to mention Cython. It has also shown that compiled Python can often run at the speed of C++. But its real merit is the ease with which we can write C extensions for Python. I think in the long run, CPython could benefit from Numba as well as Cython. Numba can be used to boost performance critical code without having to resort to C. Cython can e.g. be used to implement the standard library without hand-written C modules. It would lower the threshold for contribution as well as improve readability. Sturla

Ludovic Gasc schrieb am 21.12.2014 um 00:55:
Regarding Cython, it's actually unlikely that it will bring anything. The proposed feature is about specifying Python object types for function arguments and (to a certain extent) local variables, and Cython is already pretty good in guessing those or doing optimistic optimisations that just say "if it looks like you're using a dict, let's generate special code that speeds up dict access and leaves other stuff to a fallback code path". The type semantics are even different for builtin types. When you type a variable as "dict" in Cython, it will reject subtypes as there's no use in typing a variable as "dict" if it can't be optimised as a dict due to potential subtype overrides. The normal optimistic "fast for dicts, works for all objects" code that Cython generates here is exactly as fast as code that would allow only dict or subtypes. Typing a variable as arbitrary size Python "int" is also useless as there is no advantage Cython can draw from it, but even typing a variable as Python "float", which is identical to a C double, isn't helpful due to the semantic difference for subtypes. Cython could be slightly improved to also generate special casing low level code paths for Python integer and floating point operations, but then, why waste time doing that when simply typing a variable as plain C int or double gives you full native speed without any of the hassle of runtime type checks, bloated special casing code paths, etc.? And typing variables with user implemented Python classes is equally useless as there are no special optimisations that Cython could apply to them. So, from the POV of Cython, I do see the advantage of a syntax that integrates with Python's own syntax (Cython's "pure Python mode" [1] has provided similar typing support for years now), but please don't expect any performance advantage from this. Stefan [1] http://docs.cython.org/src/tutorial/pure.html

On 24/12/14 12:45, Stefan Behnel wrote:
Numba is also very similar to Cython here. There is a set of types and expressions for which it can produce an efficient code path, and otherwise it falls back to calling the CPython interpreter. Psyco also used a similar strategy. This precisely where the Uladen Swallow went astray because they tried to accelerate any code path. Cython's advantage is that we can mix "down to the iron" C or C++ with Python, without having to use or know the Python C API. And of course Cython is not only for speed. It is also for making it easier to write C or C++ extensions for any thinkable reason. Sturla

On 12/24/2014 06:40 AM, Sturla Molden wrote:
My thoughts is that making python easier to multi-process on multi-core CPUs will be where the biggest performance gains will be. Think of 100 core chips in as soon as 5 or 6 years. (Doubling about every two years, you can get 16 core CPU's now.) It's not uncommon to have multi-term expressions where each term can be processed independently of the others. result = foo(e1, e2, ... eN) Where each expression, "eN" may be call's to other functions, which may also have multiple terms. If it can be determined at parse time that they can't possibly effect each other, then these terms can be evaluated in parallel, and byte code, or compiled code, that does that can be generated. (I'm not sure it's doable, but it seems like it should be possible.) And... the same type of data dependence graphs needed to do that, can aid in checking correctness of code. Which is the near term benefit. Cheers, Ron

On 12/26/2014 11:13 AM, Antoine Pitrou wrote:
Which is silly?, 100 cores, or making python easier to multi-process? The 5 or 6 years figure is my optimistic expectation for high end workstations and servers. Double that time for typical desktop, and maybe triple that for wearable devices. Currently you can get 8 core high end desktop systems, and up to 18 core work stations with windows 8. They probably run python too. I think the unknown is how much time it will take, not weather or not it will happen. Cheers, Ron

I think the 5-6 year estimate is pessimistic. Take a look at http://en.wikipedia.org/wiki/Xeon_Phi for some background. I have it on a sort of nudge-and-wink authority that Intel already has in-house chips with 128 cores, and has distributed prototypes to a limited set of customers/partners. More good reasons to look at PyPy-STM, which has reached the stage of "useful" I think. On Fri, Dec 26, 2014 at 1:03 PM, Ron Adam <ron3200@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Fri, 26 Dec 2014 13:11:19 -0700 David Mertz <mertz@gnosis.cx> wrote:
I think the 5-6 year estimate is pessimistic. Take a look at http://en.wikipedia.org/wiki/Xeon_Phi for some background.
"""Intel Many Integrated Core Architecture or Intel MIC (pronounced Mick or Mike[1]) is a *coprocessor* computer architecture""" Enough said. It's not a general-purpose chip. It's meant as a competitor against the computational use of GPU, not against traditional general-purpose CPUs. Regards Antoine.

On Fri, Dec 26, 2014 at 1:39 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes and no: The cores of Intel MIC are based on a modified version of P54C design, used in the original Pentium. The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelization software tools. Programming tools include OpenMP, OpenCL, Cilk/Cilk Plus and specialised versions of Intel's Fortran, C++ and math libraries. x86 is pretty general purpose, but also yes it's meant to compete with GPUs too. But also, there are many projects--including Numba--that utilize GPUs for "general computation" (or at least to offload much of the computation). The distinctions seem to be blurring in my mind. But indeed, as many people have observed, parallelization is usually non-trivial, and the presence of many cores is a far different thing from their efficient utilization.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 26, 2014, at 23:05, David Mertz <mertz@gnosis.cx> wrote:
I think what we're eventually going to see is that optimized, explicit parallelism is very hard, but general-purpose implicit parallelism is pretty easy if you're willing to accept a lot of overhead. When people start writing a lot of code that takes 4x as much CPU but can run on 64 cores instead of 2 and work with a dumb ring cache instead of full coherence, that's when people will start selling 128-core laptops. And it's not going to be new application programming techniques that make that happen, it's going to be things like language-level STM, implicit parallelism libraries, kernel schedulers that can migrate low-utilization processes into low-power auxiliary cores, etc.

On Sat, Dec 27, 2014 at 01:28:14AM +0100, Andrew Barnert wrote:
I disagree. PyParallel works fine with existing programming techniques: Just took a screen share of a load test between normal Python 3.3 release build, and the debugged-up-the-wazzo flaky PyParallel 0.1-ish, and it undeniably crushes the competition. (Then crashes, 'cause you can't have it all.) https://www.youtube.com/watch?v=JHaIaOyfldo Keep in mind that's a full debug build, but not only that, I've butchered every PyObject and added like, 6 more 8-byte pointers to it; coupled with excessive memory guard tests at every opportunity that result in a few thousand hash tables being probed to check for ptr address membership. The thing is slooooooww. And even with all that in place, check out the results: Python33: Running 10s test @ http://192.168.1.15:8000/index.html 8 threads and 64 connections Thread Stats Avg Stdev Max +/- Stdev Latency 13.69ms 11.59ms 27.93ms 52.76% Req/Sec 222.14 234.53 1.60k 86.91% Latency Distribution 50% 5.67ms 75% 26.75ms 90% 27.36ms 99% 27.93ms 16448 requests in 10.00s, 141.13MB read Socket errors: connect 0, read 7, write 0, timeout 0 Requests/sec: 1644.66 Transfer/sec: 14.11MB PyParallel v0.1, exploiting all cores: Running 10s test @ http://192.168.1.15:8080/index.html 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.32ms 2.29ms 27.57ms 92.89% Req/Sec 540.82 154.01 0.89k 75.34% Latency Distribution 50% 1.68ms 75% 2.00ms 90% 3.57ms 99% 11.26ms 40828 requests in 10.00s, 350.47MB read Requests/sec: 4082.66 Transfer/sec: 35.05MB ~2.5 times improvement even with all its warts. And it's still not even close to being loaded enough -- 35% of a gigabit link being used and about half core use. No reason it couldn't do 100,000 requests/s. Recent thread on python-ideas with a bit more information: https://mail.python.org/pipermail/python-ideas/2014-November/030196.html Core concepts: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... Trent.

On Dec 27, 2014, at 8:34, Trent Nelson <trent@snakebite.org> wrote:
Then what are you disagreeing with? My whole point is that it's not going to be new application programming techniques that make parallelism accessible.
Sure, sloooooww code that's 8x as parallel runs 2.5x as fast. What's held things back for so long is that people insist on code that's almost as fast on 1- or 2-core machines and also scales to 8-core machines. That silly constraint is what's held us back. And now that mainstream machines are 2 to 8 cores instead of 1 to 2, and the code you have to be almost as fast as is still sequential, things are starting to change. Even when things like PyParallel or PyPy's STM aren't optimized at all, they're already winning.

On Fri, 26 Dec 2014 14:03:42 -0600 Ron Adam <ron3200@gmail.com> wrote:
This :-)
The 5 or 6 years figure is my optimistic expectation for high end workstations and servers.
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel. Expecting to get a 100-core general purpose CPU is expecting to get something that's unfit for most daily tasks, which is rather pessimistic. If the industry had followed the enthusiastic predictions from 5 years ago, the average desktop CPU would probably have 16+ HW threads right now - which it doesn't: the average core count stagnates between 2 and 4. Sure, some specific workloads in scientific computing may benefit - but if I understand correctly you can already release the GIL using Cython, and perhaps soon using Numba. Besides the serial nature of most workloads, there are other limits to multicore scalability, such as DRAM access latency and bandwidth. There's little point in having 100 CPU cores if they all compete for memory access as executing multiple threads simultaneously reduces the locality of accesses and therefore the efficiency of on-chip caches. Regards Antoine.

Antoine Pitrou schrieb am 26.12.2014 um 21:37:
I second this. Parallelisation continues to be difficult (and most often impossible or close enough), except for the trivial cases. As long as that holds, large multi-core chips will remain special purpose. Also, don't forget Knuth's famous quote on premature optimisation. In ~97% of the cases, (non-trivial) parallelisation is simply not needed, and thus is better not be done. Stefan

On 12/26/2014 02:37 PM, Antoine Pitrou wrote:
Depends on what and how it's used.
This is changing. It's a feedback loop, as new hardware becomes available, software engineers learn to take advantage of it, and as they do, it drives the market... and then more hardware improvements are added by hardware engineers. Hey, they need a pay check, and investors need dividends, and there are enough people interested to make everyone think it's worth doing. <shrug> It's a huge economic machine we are talking about, and it's moving forward. I doubt we could stop it if we wanted to. So it's a process that is changing across the board. If you think of one part of that staying the same, either the hardware or the software, the other parts may seem "silly", but if you think of it all changing at once, it starts to make more sense. I wanted to find a good example, and the first thing that came to mind is the ever present web browser. That is probably the one piece of software that is run by more users directly than any other software program. A quick search brought up an interesting talk by Jack Moffitt about the Mozilla SERVO project and the Rust programming language. It's a bit long but interesting. At one point, (18:30 to 19:45 minute marks), he mentions a tree design where you spawn threads for each child, and they spawn threads for each of their children. http://www.infoq.com/presentations/servo-parallel-browser That of course is the software engineering side of things, and then you also have the hardware engineering side, but both sides are actively being developed at the same time.
Yes, I think it got off to a slower start than many expected. And with only a few cores, it makes more sense to distribute process's rather than threads. But I think this will change as the number of cores increase, and the techniques to use them also develop. Until we have the kind of fine grained threading Jack Moffitt was describing.
A web browser is about as mainstream as you can get. And it's presence is big enough to drive the computer market where ever it goes. ;-)
It will be interesting to see how this changes. ;-) Cheers, Ron

On Fri, 26 Dec 2014 19:40:25 -0600 Ron Adam <ron3200@gmail.com> wrote:
Yeah, so what? Spawning threads isn't hard, exploiting parallelism is.
Yes, I think it got off to a slower start than many expected.
It didn't "get off to a slower start". It's actually perfectly in line with predictions of people like me or Linus Torvalds :-)
A web browser is about as mainstream as you can get.
Then let's talk about it again when a web browser manages to fully exploit 100 threads without thrashing CPU caches. Regards Antoine.

On 12/27/2014 04:00 AM, Antoine Pitrou wrote:
On Fri, 26 Dec 2014 19:40:25 -0600 Ron Adam<ron3200@gmail.com> wrote:
Yeah, so what? Spawning threads isn't hard, exploiting parallelism is.
I think you are using a stricter definition of parallelism here. If many programs spawn many threads, and those threads are distributed across many cores, then you have a computer that gains from having more cores. Yes, There are a lot of technical difficulties with doing that efficiently. The point I was trying to make at the start of this is type-hints can be used to identify parts of programs that don't depend on other parts, and that some of those parts can be run in parallel. Basically it involves converting a sequential program into a dependency tree, and allowing non-inter dependent child nodes to run in parallel. The type-hints may help with this. As for the technical details of building a chip with 100 cores and how to manage the cache's for them. I'm happy to let the chip engineers work on that. I do believe we will see 100 core chips in a few years regardless of weather or not it makes complete sense to do so. Keep in mind that other languages may be able to take advantage of them much easier than python can, including newly emerging languages built specifically to use parallelism more effectively.
Yes, I think it got off to a slower start than many expected.
It didn't "get off to a slower start". It's actually perfectly in line with predictions of people like me or Linus Torvalds:-)
"... than many expected." It wasn't surprising to me either.
A web browser is about as mainstream as you can get.
Then let's talk about it again when a web browser manages to fully exploit 100 threads without thrashing CPU caches.
How about 80% utilisation with only 20% thrashing? ;-) Cheers, Ron

On Sat, 27 Dec 2014 14:09:59 -0600 Ron Adam <ron3200@gmail.com> wrote:
How about 80% utilisation with only 20% thrashing? ;-)
Well, how about you give actual examples instead of throwing numbers around. Otherwise let's wait 5 other years to see how the prediction of 100-core general purpose CPUs pans out. Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel.
Computer graphics is intrinsically data-parallel, hence the GPU. A computer with a 128 core CPU would have no use for a GPU. Taking more market shared from Nvidia and AMD would be one reason why Intel might want to produce such a chip. It would also remove the need for dedicated video RAM and special vertex buffer objects, and thus simplify the coding of 3D graphics.
You can do this under Numba too. Sturla

On Sat, Dec 27, 2014 at 01:54:02AM +0000, Sturla Molden wrote:
Or, to put it another way, a computer with a GPU has no need for a 128 core CPU. I think it is more likely that general purpose desktop computers will start using GPUs than that they will start using 100+ core CPUs. Apart from graphics, and some "embarrassingly parallel" tasks, most tasks are inherently serial. E.g. it takes 9 months to make a baby, you can't do it in 1 month by using 9 women or 100 chickens. Even those which aren't inherently serial usually have some serial components, and Amadahl's Law puts an upper bound on how much of a performance increase you can get by parallising it: http://en.wikipedia.org/wiki/Amdahl's_law I expect that the number of cores used by general purpose desktops will increase very slowly. It makes sense for servers to use as many cores as possible, since they typically run many CPU-bound tasks in parallel, but that doesn't apply so much to desktops and it certainly doesn't apply to wearable computers. The idea of a wearable computer using a general-purpose CPU with 100 cores strikes me as sheer fantasy: most of the cores will be idling all of the time, the effect on battery life will be terrible, and the heat generated prohibitive. TL;DR I expect that two or four cores will remain standard for desktop computers for a long time, and one core machines will still be around for a while. Massively parallel computing will remain niche, the GIL is not really the bottleneck some people think it is, and when it is a bottleneck, existing ways of working around it are still very effective. -- Steven

Sturla Molden writes:
Taking more market shared from Nvidia and AMD would be one reason why Intel might want to produce such a chip.
You've put your finger on a very important class of reasons. And if history is any guide, they will become cheaper quickly. We should not underestimate the ability of consumers to be impressed by easily grasped numbers like "cores" and "GHz" once price comes within reach.

Antoine Pitrou <solipsis@pitrou.net> wrote:
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel.
Henry Ford invented a solution to parallelization of many repetitive serial tasks 100 years ago. His solution is known as a conveyor belt or pipeline. If you can split up a serial task into a series of smaller subtasks, you can chain them as a pipeline of worker threads. It often shows up in signal processing and multimedia. Take e.g. a look at the design of VTK. You also have it in e.g. asynchronous i/o if you use threads and queues instead of coroutines to set up a pipeline. Then there is big class of data-parallel tasks, such as e.g. in computer graphics. You e.g. have more than a million pixels on a screen, and each pixel must be processed independently. MapReduce is also a buzzword that describes a certain data parallel task. You also find it scientific computing, e.g. in linear algebra where we use libraries like BLAS and LAPACK. Then there is the ForkJoin tasks, a modern buzzword for a certain type of divide and conquer. A classical example is the FFT. Mergesort would be another example. Take a look at a statement like a = [for foobar(y) in sorted(x)] Here we have a data parallel iteration over sorted(x) and the evaluation of sorted(x) is fork-join parallel. Is it unthinkable that a future compiler could figure this out on its own? Sturla

On 21 December 2014 at 09:55, Ludovic Gasc <gmludo@gmail.com> wrote:
The primary goals of the type hinting standardisation effort are improving program correctness (through enhanced static analysis) and API documentation (through clearer communication of expectations for input and output, both in the code, in documentation, and in IDEs). It should also allow the development of more advanced techniques for function signature based dispatch and other forms of structured pattern matching.
There's also the fact that with both Numba and PyPy now supporting selective JIT acceleration of decorated functions within the context of a larger CPython application, as well as Cython's existing support for precompilation as a C extension, the pattern of profiling to find performance critical areas, and finding ways to optimise those, now seems well established. (Hence my suggestion the other day that we could likely use an introductory how to guide on performance profiling, which could also provide suggestions for optimisation tools to explore once the hot spots have been found). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

What Nick is trying to say is that mypy and other optimizers have been quite successful without type hints. I believe that Jim Baker is hopeful that he will be able to do some interesting things for Jython with type hints though. --Guido On Sat, Dec 20, 2014 at 5:15 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Hi Nick, Thanks for your answer. I understand primary goal, I'm not completely naive on this question: A long time ago, I used a lot Type Hinting with PHP 5. Nevertheless, in Python community, you can find a lot of libraries to improve performance based on types handling with different optimization strategies (numba, pypy, pythran, cython, llvm-py, shedskin...). To my knowledge, you don't have the same number of libraries to do that with another dynamic language. It means that in Python community we have this problematic. It's like with async pattern, in Python you have a plenty of libraries (Twisted, eventlet, gevent, stackless, tornado...) and now, with AsyncIO, the community should converge on it. And yes, I understand that it's almost impossible to create a silver bullet to improve automagically performance, but, as with my simple dev eyes, the common pattern I see with pythran, cython... is the type handling. They don't use only this strategy to improve performance, but it's the biggest visible part in example codes I've seen. Guido: " optimizers have been quite successful without type hints" <= Certainly, but instead of to loose time to try to guess the right data structure, maybe it could be faster that the developer gives directly what he wants. To be honest, I'm a little bit tired to listen some bias like "Python is slow", "not good for performance", "you must use C/Java/Erlang/Go..." For me, Python has the right compromise to write quickly readable source code and performance possibilities to speed up your code. More we have primitives in CPython to build performant applications, more it will be easier to convince people to use Python. Regards. -- Ludovic Gasc On 21 Dec 2014 02:15, "Nick Coghlan" <ncoghlan@gmail.com> wrote:

On Mon, Dec 22, 2014 at 6:23 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
There may be something to that. Most of the people I've heard moaning that "Python is slow" aren't backing that up with any actual facts. Telling people "Add these type hints to your code if you feel like it" might satisfy their need for red tape and the gut feeling that it's speeding up the code. ChrisA

On 22/12/14 00:20, Chris Angelico wrote:
There may be something to that. Most of the people I've heard moaning that "Python is slow" aren't backing that up with any actual facts.
One fact is that the fastest Random Forests classifier known to man is written in Python (with addition of some Cython in critical places). For those who don't know, Random Fortests is one of the strongest algorithms (if not the strongest) for estimating or predicting a probability p as a non-linear function of N variables. Here are some interesting slides: http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn Just look at what they could make Python and Cython do, compared to e.g. the pure C++ solution in OpenCV. Even more interesting, the most commonly used implementation of Random Forests is the version for R, which is written in Fortran. There is actually a pure Python version which is faster... So who says Python is slow? I think today, mostly people who don't know what they are talking about. We see Python running at the biggest HPC systems today. We see Python algorithms beating anything we can throw at it with C++ or Fortran. It is time to realize that the flexibility of Python is not just a slowdown, it is also a speed boost because it makes it easier to write flexible and complex algorithms. But yes, Python needs help from Numba or Cython, and sometimes even Fortran (f2py), to achieve its full potential. The main take-home lesson from those slides, by the way, is the need to (1) profile Python code to identify bottlenecks -- humans are very bad at this kind of guesswork -- and (2) the need to use code annotation in Cython (compile with -a) to limit the Python overhead. Sturla

On 22 December 2014 at 22:05, Sturla Molden <sturla.molden@gmail.com> wrote:
Yep - the challenge is to: a) communicate the potential of these tools effectively to new Python users b) ensure these tools are readily available to them (without turning into a build nightmare) A profiling HOWTO (especially one that highlights visual tools like SnakeViz) may be helpful on the first point, but the second one is a fair bit harder to address in the general case. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 22/12/14 17:42, Nick Coghlan wrote:
For b), Anaconda by Continuum Analytics and Canopy by Enthought are doing a great job. But today it is also possible to create a minimalistic environment with only a few 'pip install' commands. It is not a build nightmare anymore. Sturla

Thank you everybody for all pieces of information, very interesting. I reply to everybody in the same e-mail: 1. For Python 3 usage: Where I work, we switched all new projects to Python 3 since almost one year. To be honest with you, it wasn't to be up-to-date, but it was for a new feature not natively present in Python 2: AsyncIO. We made a lot of Twisted (Telephony+WebSockets) and Django/Flask daemons (WebServices), but I wanted to: A. Create daemons all-in-one (Telephony+WebSockets+WebServices) to share easier our business logic for a same project. B. Simplify architecture: Twisted and Django are very complicated for our simple needs. The side effects are: A. Our productivity is better, we finish quicker our projects for our clients, because we share more source code and the architecture is simpler to handle. B. The performances we have are better compare to the past. Why I tell you that ? To give you a concrete example: if you want to motivate Python developers from the battlefield to migrate to Python 3, you need to add features/performances/... in Python 3. Not add all PyPI libraries in CPython, but add features you can't add easily via a library, like yield from, AsyncIO or Type Hinting. On production systems, who cares is Python 2/3, Go, Erlang... ? Certainly not clients and even management people. If you want that we use Python 3, please give us arguments to "sell" the migration to Python 3 in the company. 2. For Python 3 deployment: I use Pythonz: https://github.com/saghul/pythonz to quickly deploy a Python version on a new server with an Ansible recipe. Certainly, some sys admins could be shocked by this behaviour because it's forbidden in packaging religion, but who cares ? We lose less time to deploy, it's more reproducible between dev environment and production, and upgrades are very easy via Ansible. 3. About PyPy usage: I've made some benchmarks with the same WebService between a Flask daemon on PyPy and an AsyncIO daemon on CPython, it was very interesting: Compare to our needs, asynchronous pattern give us more performance that PyPy. Yes, I know, I've compared apples with pears, but at the end, I only want how many customers I can stack on the same server. More I stack, less it costs for my company. I'm waiting Python 3.3 support in PyPy to push that on production with a real daemon. No AsyncIO, No PyPy. 4. About Python performance: Two things changed my mind about the bias "Python is slow": A. Python High Performance: http://shop.oreilly.com/product/0636920028963.do B. Web Framework Benchmarks: http://www.techempower.com/benchmarks/ Especially with B.: this benchmark isn't the "truth", but at least, you can compare a lot of languages/frameworks based on examples more closer than my use cases, compare to others benchmarks. But, for example, if you compare Python frameworks with Erlang frameworks on "multiple queries", you can see that, in fact, Python is very good. In my mind, Erlang is very complicated to code compare to Python, but you'll have better performances in all cases, I had a wrong opinion. Finally, everybody has bias about programming languages performance. 5. Type Hinting - Performance booster (the first goal of my e-mail): Thank you Guido, your example is clear. I understand that it will be step-by-step, it's a good thing. I took the liberty to send an e-mail, because I didn't sure to understand correctly the global roadmap. I'm the first to understand that it's very difficult, and maybe that CPython will never use Type Hinting to improve performances, because it isn't really simple to implement. I'm pretty sure I can't help you to implement that, but at least, I can promote to others. Thank you everybody again for your attention. Met vriendelijke groeten, -- Ludovic Gasc On Mon, Dec 22, 2014 at 6:06 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

Am 23.12.14 um 00:56 schrieb Ludovic Gasc:
I'm waiting Python 3.3 support in PyPy to push that on production with a real daemon. No AsyncIO, No PyPy.
Trollius should work on PyPy already: https://pypi.python.org/pypi/trollius Trollius is a portage of the Tulip project (asyncio module, PEP 3156) on Python 2. Trollius works on Python 2.6-3.5. It has been tested on Windows, Linux, Mac OS X, FreeBSD and OpenIndiana. http://trollius.readthedocs.org/en/latest/changelog.html 2014-07-30: Version 1.0.1 This release supports PyPy and has a better support of asyncio coroutines, especially in debug mode. Mike

On 22 December 2014 at 05:23, Ludovic Gasc <gmludo@gmail.com> wrote:
Perhaps the most effective thing anyone could do to make significant progress in the area of CPython performance is to actually get CodeSpeed working properly with anything other than PyPy, as automated creation of clear metrics like that can be incredibly powerful as a motivational tool (think about the competition on JavaScript benchmarks between different browser vendors, or the way the PyPy team use speed.pypy.org as a measure of their success in making new versions faster). Work on speed.python.org (as a cross-implementation counterpart to speed.pypy.org) was started years ago, but no leader ever emerged to drive the effort to completion (and even a funded development effort by the PSF failed to produce a running instance of the service). Another possible approach would be to create a JavaScript front end for PyPy (along the lines of the PyPy-based Topaz interpreter for Ruby, or the HippyVM interpreter for PHP), and make a serious attempt at displacing V8 at the heart of Node.js. (The Node.js build system for binary extensions is already written in Python, so why not the core interpreter as well? There's also the fact that Node.js regularly ends up running on no longer supported versions of V8, as V8 is written to meet the needs of Chrome, not those of the server-side JavaScript community). One key advantage of the latter approach is that the more general purpose PyPy infrastructure being competitive with the heavily optimised JavaScript interpreters created by the browser vendors on a set of industry standard performance benchmarks is much, much stronger evidence of PyPy's raw speed than being faster than the not-known-for-its-speed CPython interpreter on a set of benchmarks originally chosen specifically by Google for the Unladen Swallow project. Even with Topaz being one of the fastest Ruby interpreters, to the point of Oracle Labs using it as a relative benchmark for comparison of JRuby's performance in http://www.slideshare.net/ThomasWuerthinger/graal-truffle-ethdec2013, that's still relatively weak evidence for raw speed, since Ruby in general is also not well known for being fast. (Likewise, HippyVM being faster than Facebook's HHVM is impressive, but vulnerable to the same counter-argument that people make for Python and Ruby, "If you care about raw speed, why are you still using PHP?") Objective benchmarks and real world success stories are the kinds of things that people find genuinely persuasive - otherwise we're just yet another programming language community making self-promoting claims on the internet without adequate supporting evidence. ( http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf is an example of providing good supporting evidence that compares Numba and Cython, amongst many other alternatives, to the speed of raw C++ and FORTRAN for evaluation of a particular numeric model - given the benefits they offer in maintainability relative to the lower level languages, they fare extremely well on the speed front) As things stand, we have lots of folks wanting *someone else* to do the work to counter the inaccurate logic of "CPython-the-implementation tends to prioritise portability and maintainability over raw speed, therefore Python-the-language is inherently slow", yet very few volunteering to actually do the work needed to counter it effectively in a global sense (rather than within the specific niches currently targeted by the PyPy, Numba, and Cython development teams - those teams do some extraordinarily fine work that doesn't get the credit it deserves due to a mindset amongst many users that only CPython performance counts in cross-language comparisons). Regards, Nick. P.S. As noted earlier, a profiling and optimising HOWTO in the standard documentation set would also make a lot of sense as a way of making these alternatives more discoverable, but again, it needs a volunteer to write it (or at least an initial draft which then be polished in review on Reitveld). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Ludovic, If I understand you correctly you would like type annotations to be used to speed up Python programs. You are not alone, but it's a really hard problem, because nobody wants their program to break. I think that eventually this will be possible, but in the short term, it would be a much, much bigger effort to create such a compiler than it is just to define (and reach agreement on) the syntax for annotations. I like to take smaller steps first, especially when it comes to defining syntax. Maybe once we have a standard type hinting notation someone will write a compiler. In the meantime, let me give you an example of why it's such a hard problem. Suppose I have created a library that contains a function that finds the longest of two strings: def longest(a, b): return a if len(a) >= len(b) else b Now I start adding annotations, and I change this to: def longest(a: str, b: str) -> str: return a if len(a) >= len(b) else b But suppose someone else is using the first version of my library and they have found it useful for other sequences besides strings, e.g. they call longest() with two lists. That wasn't what I had written the function for. But their program works. Should it be broken when the annotations are added? Who gets the blame? How should it be fixed? Now, I intentionally chose this example to be quite controversial, and there are different possible answers to these questions. But the bottom line is that this kind of situation (and others that can be summarized as "duck typing") make it hard to create a Python compiler -- so hard that nobody has yet cracked the problem completely. Maybe once we have agreed on type hinting annotations someone will figure it out. In the meantime, I think "offline" static type checking such as done by mypy is quite useful in and of itself. Also, IDEs will be able to use the information conveyed by type hinting annotations (especially in stub files). So I am content with limiting the scope of the proposal to these use cases. Salut, --Guido On Sun, Dec 21, 2014 at 11:23 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 22 December 2014 at 14:39, Guido van Rossum <guido@python.org> wrote:
I'd like to echo this particular sentiment, since my other replies could easily give the impression I disagreed. I'm actually optimistic that once this kind of annotation becomes popular, folks will find ways to make use of it for performance improvements (even if it's just priming a JIT cache to reduce warmup time). The only idea I disagree with is that it's necessary to *wait* for this feature for Pythonistais to start countering the "Python is slow" meme - the 3.5 release is still almost a year away, and the migration of the overall ecosystem from Python 2 to 3 is definitely still a work in progress. As a result, I believe that in the near term, it's better to look at the barriers to adoption for existing tools like PyPy, Numba and Cython and start figuring out whether those barriers can be reduced in some way. In the case of PyPy, deployment of web services using it can be awkward relative to CPython. A prebuilt image on DockerHub that includes uWSGI + the PyPy plugin + PyPy could be a good way to reduce barriers to entry for Linux based deployments. A separate prebuilt image on DockerHub could also be a good way to make it easy for folks to experiment with building web services based on the PyPy STM branch. Another barrier to PyPy adoption is the level of use of cffi vs hand crafted C extensions. While the inclusion of pip with CPython makes cffi more readily available, there's also in principle agreement to incorporate it (and its dependencies) into the standard library. Folks interested in increasing PyPy's popularity may want to get in touch with the cffi developers and see what remaining barriers (other than available time) there are to moving forward with that proposal as a PEP for Python 3.5. (Standard library backports have more credibility as dependencies in many situations, since the standard library inclusion acts as a clear endorsement of the original project) For Numba and Cython, challenges are more likely to be centred around the process of getting started with the tools, and providing a suitable build environment. Again, Docker images can help, but in this case, by providing a preconfigured build environment that can either run in a local container on Linux, or under a Linux VM on Mac OS X or Windows. None of those Docker related ideas need to be filtered through the CPython core development team, though, nor even through the development teams of the individual projects. All it takes is someone sufficiently motivated to figure out how to do it, publish their work, and then approach the relevant project to say "Hey, I did this, does anyone else want to help promote and maintain it, and perhaps give it an official project blessing?". Unlike the Docker image ideas, putting together the cffi PEP *does* need involvement from both the developers of the cffi project and the rest of the CPython core development team, but if the cffi folks are amenable, it's conceivable someone else could do the actual PEP wrangling. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21/12/14 02:15, Nick Coghlan wrote:
Numba is not very mature, but it can already JIT accelerate most of Python bytecode to speeds comparable to -O2 in C. And nit just that, it does not need type-hints, it has also an "autojit" that can infere types at runtime. What the Unladen Swallow failed and Numba succeded? I think for two reasons: First, Numba was designed for a particular purpose: numerical computing. No offence to Google, but the numerics folks are the ones who really understand how to beat gigaflops out of the CPU. Numba was created by Travis Oliphant, who was also the creator of the modern NumPy package (as opposed to Jim Hugunin's original Numerics package and NASA's Numarray packages). The Unladen Swallow did the mistake of trying to accelerate "everything". Second, LLVM has improved. At the time the swallow was hatched, LLVM was not really useful as a JIT compiler framework. It was in comparison to GCC even a lousy static compiler. Now it has matured and it is excellent as as a JIT compiler as well as a static compiler. It is still a little behind GCC and Intel compilers on static optimization, but more than good enough to be the default compiler on MacOSX and FreeBSD. It should also be mentioned that the Numba (or just NumbaPro?) integrates with CUDA, and it can actually run Python bytecode on the GPU. For certain data-parallel tasks it can make your Python program yield a teraflop with a modern Nvidia GPU. I also want to mention Cython. It has also shown that compiled Python can often run at the speed of C++. But its real merit is the ease with which we can write C extensions for Python. I think in the long run, CPython could benefit from Numba as well as Cython. Numba can be used to boost performance critical code without having to resort to C. Cython can e.g. be used to implement the standard library without hand-written C modules. It would lower the threshold for contribution as well as improve readability. Sturla

Ludovic Gasc schrieb am 21.12.2014 um 00:55:
Regarding Cython, it's actually unlikely that it will bring anything. The proposed feature is about specifying Python object types for function arguments and (to a certain extent) local variables, and Cython is already pretty good in guessing those or doing optimistic optimisations that just say "if it looks like you're using a dict, let's generate special code that speeds up dict access and leaves other stuff to a fallback code path". The type semantics are even different for builtin types. When you type a variable as "dict" in Cython, it will reject subtypes as there's no use in typing a variable as "dict" if it can't be optimised as a dict due to potential subtype overrides. The normal optimistic "fast for dicts, works for all objects" code that Cython generates here is exactly as fast as code that would allow only dict or subtypes. Typing a variable as arbitrary size Python "int" is also useless as there is no advantage Cython can draw from it, but even typing a variable as Python "float", which is identical to a C double, isn't helpful due to the semantic difference for subtypes. Cython could be slightly improved to also generate special casing low level code paths for Python integer and floating point operations, but then, why waste time doing that when simply typing a variable as plain C int or double gives you full native speed without any of the hassle of runtime type checks, bloated special casing code paths, etc.? And typing variables with user implemented Python classes is equally useless as there are no special optimisations that Cython could apply to them. So, from the POV of Cython, I do see the advantage of a syntax that integrates with Python's own syntax (Cython's "pure Python mode" [1] has provided similar typing support for years now), but please don't expect any performance advantage from this. Stefan [1] http://docs.cython.org/src/tutorial/pure.html

On 24/12/14 12:45, Stefan Behnel wrote:
Numba is also very similar to Cython here. There is a set of types and expressions for which it can produce an efficient code path, and otherwise it falls back to calling the CPython interpreter. Psyco also used a similar strategy. This precisely where the Uladen Swallow went astray because they tried to accelerate any code path. Cython's advantage is that we can mix "down to the iron" C or C++ with Python, without having to use or know the Python C API. And of course Cython is not only for speed. It is also for making it easier to write C or C++ extensions for any thinkable reason. Sturla

On 12/24/2014 06:40 AM, Sturla Molden wrote:
My thoughts is that making python easier to multi-process on multi-core CPUs will be where the biggest performance gains will be. Think of 100 core chips in as soon as 5 or 6 years. (Doubling about every two years, you can get 16 core CPU's now.) It's not uncommon to have multi-term expressions where each term can be processed independently of the others. result = foo(e1, e2, ... eN) Where each expression, "eN" may be call's to other functions, which may also have multiple terms. If it can be determined at parse time that they can't possibly effect each other, then these terms can be evaluated in parallel, and byte code, or compiled code, that does that can be generated. (I'm not sure it's doable, but it seems like it should be possible.) And... the same type of data dependence graphs needed to do that, can aid in checking correctness of code. Which is the near term benefit. Cheers, Ron

On 12/26/2014 11:13 AM, Antoine Pitrou wrote:
Which is silly?, 100 cores, or making python easier to multi-process? The 5 or 6 years figure is my optimistic expectation for high end workstations and servers. Double that time for typical desktop, and maybe triple that for wearable devices. Currently you can get 8 core high end desktop systems, and up to 18 core work stations with windows 8. They probably run python too. I think the unknown is how much time it will take, not weather or not it will happen. Cheers, Ron

I think the 5-6 year estimate is pessimistic. Take a look at http://en.wikipedia.org/wiki/Xeon_Phi for some background. I have it on a sort of nudge-and-wink authority that Intel already has in-house chips with 128 cores, and has distributed prototypes to a limited set of customers/partners. More good reasons to look at PyPy-STM, which has reached the stage of "useful" I think. On Fri, Dec 26, 2014 at 1:03 PM, Ron Adam <ron3200@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Fri, 26 Dec 2014 13:11:19 -0700 David Mertz <mertz@gnosis.cx> wrote:
I think the 5-6 year estimate is pessimistic. Take a look at http://en.wikipedia.org/wiki/Xeon_Phi for some background.
"""Intel Many Integrated Core Architecture or Intel MIC (pronounced Mick or Mike[1]) is a *coprocessor* computer architecture""" Enough said. It's not a general-purpose chip. It's meant as a competitor against the computational use of GPU, not against traditional general-purpose CPUs. Regards Antoine.

On Fri, Dec 26, 2014 at 1:39 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes and no: The cores of Intel MIC are based on a modified version of P54C design, used in the original Pentium. The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelization software tools. Programming tools include OpenMP, OpenCL, Cilk/Cilk Plus and specialised versions of Intel's Fortran, C++ and math libraries. x86 is pretty general purpose, but also yes it's meant to compete with GPUs too. But also, there are many projects--including Numba--that utilize GPUs for "general computation" (or at least to offload much of the computation). The distinctions seem to be blurring in my mind. But indeed, as many people have observed, parallelization is usually non-trivial, and the presence of many cores is a far different thing from their efficient utilization.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 26, 2014, at 23:05, David Mertz <mertz@gnosis.cx> wrote:
I think what we're eventually going to see is that optimized, explicit parallelism is very hard, but general-purpose implicit parallelism is pretty easy if you're willing to accept a lot of overhead. When people start writing a lot of code that takes 4x as much CPU but can run on 64 cores instead of 2 and work with a dumb ring cache instead of full coherence, that's when people will start selling 128-core laptops. And it's not going to be new application programming techniques that make that happen, it's going to be things like language-level STM, implicit parallelism libraries, kernel schedulers that can migrate low-utilization processes into low-power auxiliary cores, etc.

On Sat, Dec 27, 2014 at 01:28:14AM +0100, Andrew Barnert wrote:
I disagree. PyParallel works fine with existing programming techniques: Just took a screen share of a load test between normal Python 3.3 release build, and the debugged-up-the-wazzo flaky PyParallel 0.1-ish, and it undeniably crushes the competition. (Then crashes, 'cause you can't have it all.) https://www.youtube.com/watch?v=JHaIaOyfldo Keep in mind that's a full debug build, but not only that, I've butchered every PyObject and added like, 6 more 8-byte pointers to it; coupled with excessive memory guard tests at every opportunity that result in a few thousand hash tables being probed to check for ptr address membership. The thing is slooooooww. And even with all that in place, check out the results: Python33: Running 10s test @ http://192.168.1.15:8000/index.html 8 threads and 64 connections Thread Stats Avg Stdev Max +/- Stdev Latency 13.69ms 11.59ms 27.93ms 52.76% Req/Sec 222.14 234.53 1.60k 86.91% Latency Distribution 50% 5.67ms 75% 26.75ms 90% 27.36ms 99% 27.93ms 16448 requests in 10.00s, 141.13MB read Socket errors: connect 0, read 7, write 0, timeout 0 Requests/sec: 1644.66 Transfer/sec: 14.11MB PyParallel v0.1, exploiting all cores: Running 10s test @ http://192.168.1.15:8080/index.html 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.32ms 2.29ms 27.57ms 92.89% Req/Sec 540.82 154.01 0.89k 75.34% Latency Distribution 50% 1.68ms 75% 2.00ms 90% 3.57ms 99% 11.26ms 40828 requests in 10.00s, 350.47MB read Requests/sec: 4082.66 Transfer/sec: 35.05MB ~2.5 times improvement even with all its warts. And it's still not even close to being loaded enough -- 35% of a gigabit link being used and about half core use. No reason it couldn't do 100,000 requests/s. Recent thread on python-ideas with a bit more information: https://mail.python.org/pipermail/python-ideas/2014-November/030196.html Core concepts: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... Trent.

On Dec 27, 2014, at 8:34, Trent Nelson <trent@snakebite.org> wrote:
Then what are you disagreeing with? My whole point is that it's not going to be new application programming techniques that make parallelism accessible.
Sure, sloooooww code that's 8x as parallel runs 2.5x as fast. What's held things back for so long is that people insist on code that's almost as fast on 1- or 2-core machines and also scales to 8-core machines. That silly constraint is what's held us back. And now that mainstream machines are 2 to 8 cores instead of 1 to 2, and the code you have to be almost as fast as is still sequential, things are starting to change. Even when things like PyParallel or PyPy's STM aren't optimized at all, they're already winning.

On Fri, 26 Dec 2014 14:03:42 -0600 Ron Adam <ron3200@gmail.com> wrote:
This :-)
The 5 or 6 years figure is my optimistic expectation for high end workstations and servers.
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel. Expecting to get a 100-core general purpose CPU is expecting to get something that's unfit for most daily tasks, which is rather pessimistic. If the industry had followed the enthusiastic predictions from 5 years ago, the average desktop CPU would probably have 16+ HW threads right now - which it doesn't: the average core count stagnates between 2 and 4. Sure, some specific workloads in scientific computing may benefit - but if I understand correctly you can already release the GIL using Cython, and perhaps soon using Numba. Besides the serial nature of most workloads, there are other limits to multicore scalability, such as DRAM access latency and bandwidth. There's little point in having 100 CPU cores if they all compete for memory access as executing multiple threads simultaneously reduces the locality of accesses and therefore the efficiency of on-chip caches. Regards Antoine.

Antoine Pitrou schrieb am 26.12.2014 um 21:37:
I second this. Parallelisation continues to be difficult (and most often impossible or close enough), except for the trivial cases. As long as that holds, large multi-core chips will remain special purpose. Also, don't forget Knuth's famous quote on premature optimisation. In ~97% of the cases, (non-trivial) parallelisation is simply not needed, and thus is better not be done. Stefan

On 12/26/2014 02:37 PM, Antoine Pitrou wrote:
Depends on what and how it's used.
This is changing. It's a feedback loop, as new hardware becomes available, software engineers learn to take advantage of it, and as they do, it drives the market... and then more hardware improvements are added by hardware engineers. Hey, they need a pay check, and investors need dividends, and there are enough people interested to make everyone think it's worth doing. <shrug> It's a huge economic machine we are talking about, and it's moving forward. I doubt we could stop it if we wanted to. So it's a process that is changing across the board. If you think of one part of that staying the same, either the hardware or the software, the other parts may seem "silly", but if you think of it all changing at once, it starts to make more sense. I wanted to find a good example, and the first thing that came to mind is the ever present web browser. That is probably the one piece of software that is run by more users directly than any other software program. A quick search brought up an interesting talk by Jack Moffitt about the Mozilla SERVO project and the Rust programming language. It's a bit long but interesting. At one point, (18:30 to 19:45 minute marks), he mentions a tree design where you spawn threads for each child, and they spawn threads for each of their children. http://www.infoq.com/presentations/servo-parallel-browser That of course is the software engineering side of things, and then you also have the hardware engineering side, but both sides are actively being developed at the same time.
Yes, I think it got off to a slower start than many expected. And with only a few cores, it makes more sense to distribute process's rather than threads. But I think this will change as the number of cores increase, and the techniques to use them also develop. Until we have the kind of fine grained threading Jack Moffitt was describing.
A web browser is about as mainstream as you can get. And it's presence is big enough to drive the computer market where ever it goes. ;-)
It will be interesting to see how this changes. ;-) Cheers, Ron

On Fri, 26 Dec 2014 19:40:25 -0600 Ron Adam <ron3200@gmail.com> wrote:
Yeah, so what? Spawning threads isn't hard, exploiting parallelism is.
Yes, I think it got off to a slower start than many expected.
It didn't "get off to a slower start". It's actually perfectly in line with predictions of people like me or Linus Torvalds :-)
A web browser is about as mainstream as you can get.
Then let's talk about it again when a web browser manages to fully exploit 100 threads without thrashing CPU caches. Regards Antoine.

On 12/27/2014 04:00 AM, Antoine Pitrou wrote:
On Fri, 26 Dec 2014 19:40:25 -0600 Ron Adam<ron3200@gmail.com> wrote:
Yeah, so what? Spawning threads isn't hard, exploiting parallelism is.
I think you are using a stricter definition of parallelism here. If many programs spawn many threads, and those threads are distributed across many cores, then you have a computer that gains from having more cores. Yes, There are a lot of technical difficulties with doing that efficiently. The point I was trying to make at the start of this is type-hints can be used to identify parts of programs that don't depend on other parts, and that some of those parts can be run in parallel. Basically it involves converting a sequential program into a dependency tree, and allowing non-inter dependent child nodes to run in parallel. The type-hints may help with this. As for the technical details of building a chip with 100 cores and how to manage the cache's for them. I'm happy to let the chip engineers work on that. I do believe we will see 100 core chips in a few years regardless of weather or not it makes complete sense to do so. Keep in mind that other languages may be able to take advantage of them much easier than python can, including newly emerging languages built specifically to use parallelism more effectively.
Yes, I think it got off to a slower start than many expected.
It didn't "get off to a slower start". It's actually perfectly in line with predictions of people like me or Linus Torvalds:-)
"... than many expected." It wasn't surprising to me either.
A web browser is about as mainstream as you can get.
Then let's talk about it again when a web browser manages to fully exploit 100 threads without thrashing CPU caches.
How about 80% utilisation with only 20% thrashing? ;-) Cheers, Ron

On Sat, 27 Dec 2014 14:09:59 -0600 Ron Adam <ron3200@gmail.com> wrote:
How about 80% utilisation with only 20% thrashing? ;-)
Well, how about you give actual examples instead of throwing numbers around. Otherwise let's wait 5 other years to see how the prediction of 100-core general purpose CPUs pans out. Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel.
Computer graphics is intrinsically data-parallel, hence the GPU. A computer with a 128 core CPU would have no use for a GPU. Taking more market shared from Nvidia and AMD would be one reason why Intel might want to produce such a chip. It would also remove the need for dedicated video RAM and special vertex buffer objects, and thus simplify the coding of 3D graphics.
You can do this under Numba too. Sturla

On Sat, Dec 27, 2014 at 01:54:02AM +0000, Sturla Molden wrote:
Or, to put it another way, a computer with a GPU has no need for a 128 core CPU. I think it is more likely that general purpose desktop computers will start using GPUs than that they will start using 100+ core CPUs. Apart from graphics, and some "embarrassingly parallel" tasks, most tasks are inherently serial. E.g. it takes 9 months to make a baby, you can't do it in 1 month by using 9 women or 100 chickens. Even those which aren't inherently serial usually have some serial components, and Amadahl's Law puts an upper bound on how much of a performance increase you can get by parallising it: http://en.wikipedia.org/wiki/Amdahl's_law I expect that the number of cores used by general purpose desktops will increase very slowly. It makes sense for servers to use as many cores as possible, since they typically run many CPU-bound tasks in parallel, but that doesn't apply so much to desktops and it certainly doesn't apply to wearable computers. The idea of a wearable computer using a general-purpose CPU with 100 cores strikes me as sheer fantasy: most of the cores will be idling all of the time, the effect on battery life will be terrible, and the heat generated prohibitive. TL;DR I expect that two or four cores will remain standard for desktop computers for a long time, and one core machines will still be around for a while. Massively parallel computing will remain niche, the GIL is not really the bottleneck some people think it is, and when it is a bottleneck, existing ways of working around it are still very effective. -- Steven

Sturla Molden writes:
Taking more market shared from Nvidia and AMD would be one reason why Intel might want to produce such a chip.
You've put your finger on a very important class of reasons. And if history is any guide, they will become cheaper quickly. We should not underestimate the ability of consumers to be impressed by easily grasped numbers like "cores" and "GHz" once price comes within reach.

Antoine Pitrou <solipsis@pitrou.net> wrote:
I don't see how that's optimistic. Most workloads are intrinsically serial, not parallel.
Henry Ford invented a solution to parallelization of many repetitive serial tasks 100 years ago. His solution is known as a conveyor belt or pipeline. If you can split up a serial task into a series of smaller subtasks, you can chain them as a pipeline of worker threads. It often shows up in signal processing and multimedia. Take e.g. a look at the design of VTK. You also have it in e.g. asynchronous i/o if you use threads and queues instead of coroutines to set up a pipeline. Then there is big class of data-parallel tasks, such as e.g. in computer graphics. You e.g. have more than a million pixels on a screen, and each pixel must be processed independently. MapReduce is also a buzzword that describes a certain data parallel task. You also find it scientific computing, e.g. in linear algebra where we use libraries like BLAS and LAPACK. Then there is the ForkJoin tasks, a modern buzzword for a certain type of divide and conquer. A classical example is the FFT. Mergesort would be another example. Take a look at a statement like a = [for foobar(y) in sorted(x)] Here we have a data parallel iteration over sorted(x) and the evaluation of sorted(x) is fork-join parallel. Is it unthinkable that a future compiler could figure this out on its own? Sturla
participants (14)
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
David Mertz
-
Guido van Rossum
-
Ludovic Gasc
-
Mike Müller
-
Nick Coghlan
-
Ron Adam
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sturla Molden
-
Trent Nelson