[Python-ideas] Type Hinting - Performance booster ?
Trent Nelson
trent at snakebite.org
Sat Dec 27 08:34:33 CET 2014
On Sat, Dec 27, 2014 at 01:28:14AM +0100, Andrew Barnert wrote:
> On Dec 26, 2014, at 23:05, David Mertz <mertz at gnosis.cx> wrote:
>
> > On Fri, Dec 26, 2014 at 1:39 PM, Antoine Pitrou
> > <solipsis at pitrou.net> wrote:
> >> On Fri, 26 Dec 2014 13:11:19 -0700 David Mertz <mertz at gnosis.cx>
> >> wrote:
> >> > I think the 5-6 year estimate is pessimistic. Take a look at
> >> > http://en.wikipedia.org/wiki/Xeon_Phi for some background.
> >>
> >> """Intel Many Integrated Core Architecture or Intel MIC (pronounced
> >> Mick or Mike[1]) is a *coprocessor* computer architecture"""
> >>
> >> Enough said. It's not a general-purpose chip. It's meant as a
> >> competitor against the computational use of GPU, not against
> >> traditional general-purpose CPUs.
> >
> > Yes and no:
> >
> > The cores of Intel MIC are based on a modified version of P54C
> > design, used in the original Pentium. The basis of the Intel MIC
> > architecture is to leverage x86 legacy by creating a x86-compatible
> > multiprocessor architecture that can utilize existing
> > parallelization software tools. Programming tools include OpenMP,
> > OpenCL, Cilk/Cilk Plus and specialised versions of Intel's Fortran,
> > C++ and math libraries.
> >
> > x86 is pretty general purpose, but also yes it's meant to compete
> > with GPUs too. But also, there are many projects--including
> > Numba--that utilize GPUs for "general computation" (or at least to
> > offload much of the computation). The distinctions seem to be
> > blurring in my mind.
> >
> > But indeed, as many people have observed, parallelization is usually
> > non-trivial, and the presence of many cores is a far different thing
> > from their efficient utilization.
>
> I think what we're eventually going to see is that optimized, explicit
> parallelism is very hard, but general-purpose implicit parallelism is
> pretty easy if you're willing to accept a lot of overhead. When people
> start writing a lot of code that takes 4x as much CPU but can run on
> 64 cores instead of 2 and work with a dumb ring cache instead of full
> coherence, that's when people will start selling 128-core laptops. And
> it's not going to be new application programming techniques that make
> that happen, it's going to be things like language-level STM, implicit
> parallelism libraries, kernel schedulers that can migrate
> low-utilization processes into low-power auxiliary cores, etc.
I disagree. PyParallel works fine with existing programming techniques:
Just took a screen share of a load test between normal Python 3.3
release build, and the debugged-up-the-wazzo flaky PyParallel 0.1-ish,
and it undeniably crushes the competition. (Then crashes, 'cause you
can't have it all.)
https://www.youtube.com/watch?v=JHaIaOyfldo
Keep in mind that's a full debug build, but not only that, I've
butchered every PyObject and added like, 6 more 8-byte pointers to it;
coupled with excessive memory guard tests at every opportunity that
result in a few thousand hash tables being probed to check for ptr
address membership.
The thing is slooooooww. And even with all that in place, check out the
results:
Python33:
Running 10s test @ http://192.168.1.15:8000/index.html
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 13.69ms 11.59ms 27.93ms 52.76%
Req/Sec 222.14 234.53 1.60k 86.91%
Latency Distribution
50% 5.67ms
75% 26.75ms
90% 27.36ms
99% 27.93ms
16448 requests in 10.00s, 141.13MB read
Socket errors: connect 0, read 7, write 0, timeout 0
Requests/sec: 1644.66
Transfer/sec: 14.11MB
PyParallel v0.1, exploiting all cores:
Running 10s test @ http://192.168.1.15:8080/index.html
8 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.32ms 2.29ms 27.57ms 92.89%
Req/Sec 540.82 154.01 0.89k 75.34%
Latency Distribution
50% 1.68ms
75% 2.00ms
90% 3.57ms
99% 11.26ms
40828 requests in 10.00s, 350.47MB read
Requests/sec: 4082.66
Transfer/sec: 35.05MB
~2.5 times improvement even with all its warts. And it's still
not even close to being loaded enough -- 35% of a gigabit link
being used and about half core use. No reason it couldn't do
100,000 requests/s.
Recent thread on python-ideas with a bit more information:
https://mail.python.org/pipermail/python-ideas/2014-November/030196.html
Core concepts: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores
Trent.
More information about the Python-ideas
mailing list