[Python-ideas] Type Hinting - Performance booster ?

Sat Dec 27 08:34:33 CET 2014

On Sat, Dec 27, 2014 at 01:28:14AM +0100, Andrew Barnert wrote:
> On Dec 26, 2014, at 23:05, David Mertz <mertz at gnosis.cx> wrote:
> 
> > On Fri, Dec 26, 2014 at 1:39 PM, Antoine Pitrou
> > <solipsis at pitrou.net> wrote:
> >> On Fri, 26 Dec 2014 13:11:19 -0700 David Mertz <mertz at gnosis.cx>
> >> wrote:
> >> > I think the 5-6 year estimate is pessimistic.  Take a look at
> >> > http://en.wikipedia.org/wiki/Xeon_Phi for some background.
> >> 
> >> """Intel Many Integrated Core Architecture or Intel MIC (pronounced
> >> Mick or Mike[1]) is a *coprocessor* computer architecture"""
> >> 
> >> Enough said. It's not a general-purpose chip. It's meant as a
> >> competitor against the computational use of GPU, not against
> >> traditional general-purpose CPUs.
> > 
> > Yes and no:
> > 
> > The cores of Intel MIC are based on a modified version of P54C
> > design, used in the original Pentium. The basis of the Intel MIC
> > architecture is to leverage x86 legacy by creating a x86-compatible
> > multiprocessor architecture that can utilize existing
> > parallelization software tools. Programming tools include OpenMP,
> > OpenCL, Cilk/Cilk Plus and specialised versions of Intel's Fortran,
> > C++ and math libraries.
> >  
> > x86 is pretty general purpose, but also yes it's meant to compete
> > with GPUs too.  But also, there are many projects--including
> > Numba--that utilize GPUs for "general computation" (or at least to
> > offload much of the computation).  The distinctions seem to be
> > blurring in my mind.
> > 
> > But indeed, as many people have observed, parallelization is usually
> > non-trivial, and the presence of many cores is a far different thing
> > from their efficient utilization.
> 
> I think what we're eventually going to see is that optimized, explicit
> parallelism is very hard, but general-purpose implicit parallelism is
> pretty easy if you're willing to accept a lot of overhead. When people
> start writing a lot of code that takes 4x as much CPU but can run on
> 64 cores instead of 2 and work with a dumb ring cache instead of full
> coherence, that's when people will start selling 128-core laptops. And
> it's not going to be new application programming techniques that make
> that happen, it's going to be things like language-level STM, implicit
> parallelism libraries, kernel schedulers that can migrate
> low-utilization processes into low-power auxiliary cores, etc.

I disagree.  PyParallel works fine with existing programming techniques:

Just took a screen share of a load test between normal Python 3.3
release build, and the debugged-up-the-wazzo flaky PyParallel 0.1-ish,
and it undeniably crushes the competition.  (Then crashes, 'cause you
can't have it all.)

        https://www.youtube.com/watch?v=JHaIaOyfldo

Keep in mind that's a full debug build, but not only that, I've
butchered every PyObject and added like, 6 more 8-byte pointers to it;
coupled with excessive memory guard tests at every opportunity that
result in a few thousand hash tables being probed to check for ptr
address membership.

The thing is slooooooww.  And even with all that in place, check out the
results:

Python33:

Running 10s test @ http://192.168.1.15:8000/index.html

  8 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.69ms   11.59ms  27.93ms   52.76%
    Req/Sec   222.14    234.53     1.60k    86.91%
  Latency Distribution
     50%    5.67ms
     75%   26.75ms
     90%   27.36ms
     99%   27.93ms
  16448 requests in 10.00s, 141.13MB read
  Socket errors: connect 0, read 7, write 0, timeout 0

Requests/sec:   1644.66
Transfer/sec:     14.11MB

PyParallel v0.1, exploiting all cores:
Running 10s test @ http://192.168.1.15:8080/index.html

  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.32ms    2.29ms  27.57ms   92.89%
    Req/Sec   540.82    154.01     0.89k    75.34%
  Latency Distribution
     50%    1.68ms
     75%    2.00ms
     90%    3.57ms
     99%   11.26ms
  40828 requests in 10.00s, 350.47MB read
Requests/sec:   4082.66
Transfer/sec:     35.05MB

~2.5 times improvement even with all its warts.  And it's still
not even close to being loaded enough -- 35% of a gigabit link
being used and about half core use.  No reason it couldn't do
100,000 requests/s.

Recent thread on python-ideas with a bit more information:

https://mail.python.org/pipermail/python-ideas/2014-November/030196.html

Core concepts: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores

        Trent.