multi-core software

Sun Jun 7 21:00:32 EDT 2009

Lew wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Scott 
> David Daniels wrote:
>> the nub of the problem is not on the benchmarks.  There is something
>> to be said for the good old daays when you looked up the instruction
>> timings that you used in a little document for your machine, and could
>> know the cost of any loop.  We are faster now, but part of the cost of
>> that speed is that timing is a black art.  
>
> Those good old days never existed.  Those manuals never accounted for 
> things that affected timing even then, like memory latency or refresh 
> time.  SRAM cache made things worse, since the published timings never 
> mentioned cache-miss delays.  Though memory cache might seem a recent 
> innovation, it's been around a while.  It would be challenging to find 
> any published timing since the commercialization of computers that 
> would actually tell the cost of any loop.
>
> Things got worse when chips like the '86 family acquired multiple 
> instructions for doing loops, still worse when pre-fetch pipelines 
> became deeper and wider, absolutely Dark Art due to multi-level memory 
> caches becoming universal, and 
> throw-your-hands-up-and-leave-for-the-corner-bar with multiprocessor 
> NUMA systems.  OSes and high-level languages complicate the matter - 
> you never know how much time slice you'll get or how your source got 
> compiled or optimized by run-time.
>
> So the good old days are a matter of degree and self-deception - it 
> was easier to fool ourselves then that we could at least guess timings 
> proportionately if not absolutely, but things definitely get more 
> unpredictable over evolution.
>
Nonsense.  The 6502 with static memory was precisely predictable, and 
many programmers (working in machine language, naturally) counted on 
it.  Similarly the Novix 4000, when programmed in its native Forth.

And previous to that, I worked on several machines (in fact, I wrote the 
assembler and debugger for two of them) where the only variable was the 
delay every two milliseconds for dynamic memory refresh.  Separate 
control memory and data memory, and every instruction precisely 
clocked.  No instruction prefetch, no cache memory.  What you see is 
what you get.

Would I want to go back there?  No.  Sub-megaherz clocks with much less 
happening on each clock means we were operating at way under .01% of 
present day.