According to CSP advicates, this approach will break down when you
need more than 8-16 cores since cache coherence breaks down at 16
cores. Then you would have to figure out a message-passing approach
(but the messages would have to be very fast).
Catching up on Python-Dev after 3 months of travel (lucky me!), so
apologies for a "blast from the past" as I'm 6 weeks late in
replying here.
Think of the hardware implementation of cache coherence as a MIL -
memory interleave lock, or a micro interpreter lock (the hardware is
interpreting what the compiled software is doing).
That is not so different than Python's GIL, just at a lower level.
I didn't read the CSP advocacy papers, but experience in early
parallel system at CMU, Tandem Computers, and Teradata strongly
imply that multiprocessing of some sort will always be able to scale
larger than memory coherent cores -- if the application can be made
parallel at all.
It is interesting to note that all the parallel systems mentioned
above implemented fast message passing hardware of various sorts
(affected by available technologies of their times).
It is interesting to note the similarities between some of the
extreme multi-way cache coherence approaches and the various message
passing hardware, also... some of the papers that talk about
exceeding 16 cores were going down a message passing road to achieve
it. Maybe something new has been discovered in the last 8 years
since I've not been following the research... the only thing I've
read about that in the last 8 years is the loss of Jim Gray at
sea... but the IEEE paper you posted later seems to confirm my
suspicions that there has not yet been a breakthrough.
The point of the scalability remark, though, is that while lots of
problems can be solved on a multi-core system, problems also grow
bigger, and there will likely always be problems that cannot be
solved on a multi-core (single cache coherent memory) system. Those
problems will require message passing solutions. Experience with
the systems above has shown that switching from a multi-core
(semaphore based) design to a message passing design is usually a
rewrite.
Perhaps the existence of the GIL, forcing a message passing solution
to be created early, is a blessing in disguise for the design of
large scale applications. I've been hearing about problems for
which the data is too large to share, and the calculation is too
complex to parallelize for years, but once the available hardware is
exhausted as the problem grows, the only path to larger scale is
message passing parallelism... forcing a redesign of applications
that outgrew the available hardware.
That said, applications that do fit in available hardware generally
can run a little faster with some sort of shared memory approach:
message passing does have overhead.
--
Glenn
I have CDO..It's like OCD, but in alphabetical order..The way it
should be!
(a facebook group is named this, except for a misspelling.)