Integer micro-benchmarks [Smalltalk Multi-Precision Numerics vis-a-vis Highly Optimized Fixed Size Integer C++ Numerics]

Tue Apr 24 17:43:33 EDT 2001

Foreword: I actually never found a chance to use smalltalk. Several years ago I
personally purchased a Smalltalk/V box just to have a look at this language as
part of my never satisfyed curiosity, but never succeded in professionally sell
the use of this language. Fiew months ago my curiosity had me link this NG and
just read posts. I currently have no idea of where that old box is. But this
thread intrigues me.

Second Foreword: I have just seen the other post from David (the one with the
explanation of small integers) but I'm replying to this post because it's the
one that made me go to take my own test session.

"David Simmons" <pulsar at qks.com> wrote in message
news:SC4F6.17406$Jh5.17803347 at news1.rdc1.sfba.home.com...
> This benchmark is actually a reasonable test of compiler (loop) optimization
> and integer numerics optimizations.
>
> On a 1.2GHz Athlon processor with:
>     512 MB memory
>     Dual Raid ATA 100 IBM Hard drives
>     Windows 2000 Professional Service Pack 2

I'm on a PIII 700 with Win2K Pro SP1. Memory, disk and similar stuff should not
be influent for these tests.

> I ran two different tests. The first one loops to generate "only"
> SmallInteger sums; the (900,000) loop. The second test is based on the
> originally posted sample and performs the 1,000,000 count loop; and thus
> generates a significant number of large integers.

Makes sense (expecially after reading your correction in the other post)!

> For each test I launched the application, ran it once and discarded the
> result and shutdown. Then I repeated this same procedure three more times
> and used the mean of the three runs.

It shouldn't have been much different! I was not that paranoic: I just run the
tests.

> Unoptimized C++ code: (Visual Studio 6 Service Pack 5 w/processor pack 1)
>     730ms for loop over 900,000
>     808ms for loop over 1,000,000
>
> Best optimized C++ inlined code: (Visual Studio 6 Service Pack 5 w/processor
> pack 1)
>     332ms for loop over 900,000
>         2.1988 : 1 Unoptimized C++ Ratio
>     374ms for loop over 1,000,000
>         2.1604: 1 Unoptimized C++ Ratio

I'm using the VC7 beta 1 compiler. The number of switches is huge. It makes not
much sense to have it timed with "non optimized" since that's probably just
"minimally optimized" with minimally having a huge variance. I know VC7 is
better than VC6 and probably I'm much better at getting best optimizations from
a C++ compiler (I'm C++ expert) than you are, but my timings (on a supposedly
slower machines) for the best optimization are terribly different. To have
better results had to run the loop 10 times more than you and disable some
optimizations !!! Read ahead.

> **
> Variations in the 3rd decimal place of above numbers are the result of
> windows performance counter precision and OS process/thread slicing
> variations.

I attacked this by looping longer. Read ahead.

> For the C++ code and SmallScript the QueryPerformanceCounters call was used
> to obtain the millisecond timings. Presumably similar timers are used within
> VisualWorks and Dolphin (but I don't know).
>
> NOTE: Use of GetTickCount() has rounding loss that can result in reporting
> of times which are up to 20ms less than actual time.
> **

I suspect some smalltalk implementations may use that. But it really does not
matter if you do it so many times.

> To get a sense of what further adaptive inline compilation can achieve I
> note that SmallScript on the v4 vAOS Platform VM will execute the 900,000
> loop case in 1,215ms when the triangle method is inlined. If we assumed that
> the JIT/compiler was capable of hoisting the invariant calculation of 10
> triangle out of the loop then we would see a time of 95ms for SmallScript on
> the v4 AOS Platform VM.

Well. I'm not that expert of smalltalk. But I'll tell you how I did my test and
you eventually suggest me what to do.

> I should also point out that this kind of test represents a *worst-case*
> type of scenario for Smalltalk (dynamic/script language) performance (where
> it is handling arbitrary/multi-precision arithmetic) vis-a-vis statically
> typed and highly optimized C++ code performing fixed size integer truncated
> arithmetic.

Yes. I know. And in fact I was surprised that it really is this fast, beside the
fact that C++ is much faster than you think on a good implementation.

> Cincom VisualWorks 5i3NC
>     1,457ms for loop over 900,000
>         1.9959 : 1 Unoptimized C++ Ratio:
>         4.3886 : 1 Optimized C++ Ratio
>     1,789ms for loop over 1,000,000
>         2.4507 : 1 Unoptimized C++ Ratio:
>         5.3886 : 1 Optimized C++ Ratio

This is the one I'm using. And what I did is adding the following to Integer

  Integer>>triangle
        | result |
        result := 0.
        1 to: self do: [ :i | result := result + i].
        ^result

then selecting the following in a workspace and "print it"

    |sum time|
    sum := 0.
    time := Time millisecondsToRun: [
           9000000 timesRepeat: [
                sum := sum + 10 triangle]].
    Array with: sum with: time.

It did print:

#(495000000 2849)

Just to get rid of the little variations in measurements I tryed this:

    |sum tot time|
    tot := 0.
    time := Time millisecondsToRun: [
        10 timesRepeat: [
            sum := 0.
            9000000 timesRepeat: [
                sum := sum + 10 triangle].
        tot := tot + (sum / 1000000) ]].
    Array with: tot with: time.

And got:

#(4950 28810)

On the C++ side I tried this:

#include <windows.h>
#include <iostream>

int triangle( int x );

int main()
{
 LARGE_INTEGER start, end, freq;

 QueryPerformanceFrequency(&freq);
 int tot = 0;
 QueryPerformanceCounter(&start);
 for ( int j=0; j<10; j++ ) {
  int sum=0;
  for (int i = 0; i != 9000000; ++i)
   sum += triangle( 10 );
  tot+=sum / 1000000;
 }
 QueryPerformanceCounter(&end);

 __int64 delta = end.QuadPart - start.QuadPart;
 __int64 mils = delta * 1000 / freq.QuadPart;

 std::cout << "Time = " << mils <<  " result=" << tot << std::endl;
}

and I had a second file with this:

int triangle( int x )
{
 int result = 0;
 for (int i = 1; i <= x; ++i)
  result += i;
 return result;
}

Despite it being in a different compilation unit, I had VC7 with all
optimizations on to optimize everithing away and just precompute the final
result and call std::cout with it as a constant!!!! So I disabled "Global
Optimization" (that is I told it to not optimize across compilation units) and
here we go:

Time = 5585 result=4950

The order of magnitude is the same as your case.

However note: Smalltalk always has all the code at hand. C++ generally compiles
one module at a time. If I tell VC7 to optimize across compilation units it goes
to inline the triangle function and ... it does even optimize it out by pre
computing the result!!

Note that it is REALLY smart! I played with it and I could get it to a point
that it did not optimize the call out completelly but noted that triangle had no
side effect and always returned the same result for the same input so ... it
translated

 for ( int j=0; j<10; j++ ) {
  int sum=0;
  for (int i = 0; i != 9000000; ++i)
   sum += triangle( 10 );
  tot+=sum / 1000000;
 }

into something like

 tot+= triangle( 10 )*9;

Not bad for an optimizer!

--

Andrea Ferro

---------
Brainbench C++ Master. Scored higher than 97% of previous takers
Scores: Overall 4.46, Conceptual 5.0, Problem-Solving 5.0
More info http://www.brainbench.com/transcript.jsp?pid=2522556