Integer micro-benchmarks [Smalltalk Multi-Precision Numerics vis-a-vis Highly Optimized Fixed Size Integer C++ Numerics]

Tue Apr 24 22:11:38 EDT 2001

"Andrea Ferro" <AndreaF at UrkaDVD.it> wrote in message
news:9c4sjf$qo$1 at serv1.iunet.it...
...nice stuff snipped for brevity...
> >
> > On a 1.2GHz Athlon processor with:
> >     512 MB memory
> >     Dual Raid ATA 100 IBM Hard drives
> >     Windows 2000 Professional Service Pack 2
>
> I'm on a PIII 700 with Win2K Pro SP1. Memory, disk and similar stuff
should not
> be influent for these tests.

I provided this so folks would know the basics of the machine configuration
when they read the numbers.  The memory is only important as it indicates
whether or not virtual-memory issues might be an issue. The raid drive
comments are not pertinent to this post, but in general they make a big
difference for virtual-memory performance, application launch times, and
visual-studio compilation speeds.

>
> > I ran two different tests. The first one loops to generate "only"
> > SmallInteger sums; the (900,000) loop. The second test is based on the
> > originally posted sample and performs the 1,000,000 count loop; and thus
> > generates a significant number of large integers.
>
> Makes sense (expecially after reading your correction in the other post)!
>
> > For each test I launched the application, ran it once and discarded the
> > result and shutdown. Then I repeated this same procedure three more
times
> > and used the mean of the three runs.
>
> It shouldn't have been much different! I was not that paranoic: I just run
the
> tests.

The main reason for launching and quitting was to:

Jiggle the OS memory management and "working sets" of other applications.

This can affect code loading and hardware cache operations as well as
process/thread slicing interrupts for the active foreground application.

Also, all things being equal the application will load the second time into
the same physical memory space and the cpu caches will still contain fresh
data. This allows for more consistent runs.

It would have been an unfair test to launch say VisualWorks (or Dolphin)
once and run and then quit. Relative SmallScript size and launch times, they
both have many more elements to the IDE's/images and thus use significantly
more resources. In VisualWorks case I wanted to ensure it have every chance
to warm its caches.

Although I didn't explain it carefully, I also ran the VisualWorks and
Dolphin code two or three times within a single session for a given image
before taking readings -- to warm up any hardware caches as well as their
JIT caches. I did not do this for SmallScript on the AOS VM, I simply ran it
as a script which launches a new process each time.

Possibly SmallScript might have been faster if I did a couple runs before
taking a reading -- but I was more interested in a from scratch
script-execution number for SmallScript results. However the cache impact
should have only made a significant difference in the 10,000,000 run case --
where GC and allocation might come into play. On the cited platform, the
basic SmallScript AOS Platform VM lifecycle "launch to quit + some script"
is on the order of 30-50 ms.

Basically I wanted to compare the Smalltalk systems in the mode I expected
them to be most commonly used in. Since SmallScript scripts will generally
execute via launching as a  single run process or via its DLL loaded within
some parent process executed and then unloaded each time, it was important
to me measure it accordingly. This mode is a different mode from traditional
Smalltalk IDE images such as we see products such as VisualWorks and
Dolphin.

>
> > Unoptimized C++ code: (Visual Studio 6 Service Pack 5 w/processor pack
1)
> >     730ms for loop over 900,000
> >     808ms for loop over 1,000,000
> >
> > Best optimized C++ inlined code: (Visual Studio 6 Service Pack 5
w/processor
> > pack 1)
> >     332ms for loop over 900,000
> >         2.1988 : 1 Unoptimized C++ Ratio
> >     374ms for loop over 1,000,000
> >         2.1604: 1 Unoptimized C++ Ratio
>
> I'm using the VC7 beta 1 compiler. The number of switches is huge. It
makes not
> much sense to have it timed with "non optimized" since that's probably
just
> "minimally optimized" with minimally having a huge variance. I know VC7 is
> better than VC6 and probably I'm much better at getting best optimizations
from
> a C++ compiler (I'm C++ expert) than you are, but my timings (on a
supposedly
> slower machines) for the best optimization are terribly different. To have
> better results had to run the loop 10 times more than you and disable some
> optimizations !!! Read ahead.

I'm not sure I followed what you're trying to say here. I certainly have VS7
and the corresponding C++ compiler for it. But I'd be surprised if VS7
generated dramatically different code than VS6 for this simple function
example unless you've turned on global optimization and enabled static
analysis and computation that resulted in some loop hoisting -- in which
case I'd expect much more than a factor of 10 difference.

Background comment: I'd should mention that I work in C++ (for a variety of
projects) probably at least as much as Smalltalk and use it extensively in
the VM architecture. I would say that I'm reasonably expert with the all
aspects of the language (including templates) and exploit many of the
features in the AOS Platform VM design.  I'm certainly familiar with most if
not all its esoterica from the language standard and I'm intimate with the
quirks, extensions, implementation and performance issues of the Visual
Studio families C++ compiler.

So if there is a problem with the C++ number it is more like due to a real
error :-( on my part rather than a lack of understanding of C++. Or,
possibly just problems in the code generation of Visual Studio vis-a-vis
what a better static optimizer.

Better still for our true testing purposes of raw numerics performance, what
a hand coded assembly routine would do. I probably should hand code an
assembly version since I'm really expert in that area and issues of
pipelining -- I have often had to do this for small hot-spot areas where a
C++ (or other static language) compiler just wasn't satisfactory.

>
> > **
> > Variations in the 3rd decimal place of above numbers are the result of
> > windows performance counter precision and OS process/thread slicing
> > variations.
>
> I attacked this by looping longer. Read ahead.
>
> > For the C++ code and SmallScript the QueryPerformanceCounters call was
used
> > to obtain the millisecond timings. Presumably similar timers are used
within
> > VisualWorks and Dolphin (but I don't know).
> >
> > NOTE: Use of GetTickCount() has rounding loss that can result in
reporting
> > of times which are up to 20ms less than actual time.
> > **
>
> I suspect some smalltalk implementations may use that. But it really does
not
> matter if you do it so many times.

Sure, you're right, that's just good statistical analysis.

I should probably make the somewhat lame comments that my goal in publishing
the numbers was not to do carry out the same level of effort and review I
would do for a formal technical paper on the topic -- such as one might read
in an OOPSLA proceeding. I was not trying to be particularly rigorous. To be
fair I'm quite busy on SmallScript and related matters and taking the hours
to do the tests and report them approaches the limit of my regular flex-time
for this kind of stuff under my schedule.

>
> > To get a sense of what further adaptive inline compilation can achieve I
> > note that SmallScript on the v4 vAOS Platform VM will execute the
900,000
> > loop case in 1,215ms when the triangle method is inlined. If we assumed
that
> > the JIT/compiler was capable of hoisting the invariant calculation of 10
> > triangle out of the loop then we would see a time of 95ms for
SmallScript on
> > the v4 AOS Platform VM.
>
> Well. I'm not that expert of smalltalk. But I'll tell you how I did my
test and
> you eventually suggest me what to do.
>
> > I should also point out that this kind of test represents a *worst-case*
> > type of scenario for Smalltalk (dynamic/script language) performance
(where
> > it is handling arbitrary/multi-precision arithmetic) vis-a-vis
statically
> > typed and highly optimized C++ code performing fixed size integer
truncated
> > arithmetic.
>
> Yes. I know. And in fact I was surprised that it really is this fast,
beside the
> fact that C++ is much faster than you think on a good implementation.
>
> > Cincom VisualWorks 5i3NC
> >     1,457ms for loop over 900,000
> >         1.9959 : 1 Unoptimized C++ Ratio:
> >         4.3886 : 1 Optimized C++ Ratio
> >     1,789ms for loop over 1,000,000
> >         2.4507 : 1 Unoptimized C++ Ratio:
> >         5.3886 : 1 Optimized C++ Ratio
>
> This is the one I'm using. And what I did is adding the following to
Integer
>
>   Integer>>triangle
>         | result |
>         result := 0.
>         1 to: self do: [ :i | result := result + i].
>         ^result
>
> then selecting the following in a workspace and "print it"
>
>     |sum time|
>     sum := 0.
>     time := Time millisecondsToRun: [
>            9000000 timesRepeat: [
>                 sum := sum + 10 triangle]].
>     Array with: sum with: time.
>
> It did print:
>
> #(495000000 2849)
>
> Just to get rid of the little variations in measurements I tryed this:
>
>
>     |sum tot time|
>     tot := 0.
>     time := Time millisecondsToRun: [
>         10 timesRepeat: [
>             sum := 0.
>             9000000 timesRepeat: [
>                 sum := sum + 10 triangle].
>         tot := tot + (sum / 1000000) ]].
>     Array with: tot with: time.
>
> And got:
>
> #(4950 28810)

Minor nit that should have made no difference, you put the totals
calculation inside the millisecond timing loop. Ahh, never mind. I see you
did the same with the subsequent C++ code.

>
> On the C++ side I tried this:
>
> #include <windows.h>
> #include <iostream>
>
> int triangle( int x );
>
> int main()
> {
>  LARGE_INTEGER start, end, freq;
>
>  QueryPerformanceFrequency(&freq);
>  int tot = 0;
>  QueryPerformanceCounter(&start);
>  for ( int j=0; j<10; j++ ) {
>   int sum=0;
>   for (int i = 0; i != 9000000; ++i)
>    sum += triangle( 10 );
>   tot+=sum / 1000000;
>  }
>  QueryPerformanceCounter(&end);
>
>  __int64 delta = end.QuadPart - start.QuadPart;
>  __int64 mils = delta * 1000 / freq.QuadPart;
>
>  std::cout << "Time = " << mils <<  " result=" << tot << std::endl;
> }
>
> and I had a second file with this:
>
> int triangle( int x )
> {
>  int result = 0;
>  for (int i = 1; i <= x; ++i)
>   result += i;
>  return result;
> }
>
> Despite it being in a different compilation unit, I had VC7 with all
> optimizations on to optimize everithing away and just precompute the final
> result and call std::cout with it as a constant!!!! So I disabled "Global
> Optimization" (that is I told it to not optimize across compilation units)
and
> here we go:
>
> Time = 5585 result=4950
>
> The order of magnitude is the same as your case.
>
> However note: Smalltalk always has all the code at hand.

As you probably know, unlike C++ the Smalltalk compiler can't optimize
(inline) messages away because it can't rely on method definitions being
invariant. I.e., between the time a method is compiled and when it is
invoked, the methods it references may be replaced with new versions. Thus,
even if type inferencing could ascertain the correct binding information
(which is possible in this example), it doesn't help.

The only place where type information and bindings can be reliably and
safely inlined, given a dynamic and adaptive versioning execution
architecture, is in the JIT which has the knowledge and timing to know when
an inlined method has become stale and regenerate (dynamically) the inline
references.

This kind of virtual machine execution architecture not only means that
adaptive compilation leads to more version/change tolerant systems, but also
means that, in the case where there are variants, the JIT can use
heuristically acquired variant knowledge to (simplify and) optimize for
actual use cases which a "pure" static compilation analysis could not.

> C++ generally compiles
> one module at a time. If I tell VC7 to optimize across compilation units
it goes
> to inline the triangle function and ... it does even optimize it out by
pre
> computing the result!!

Global optimization in VC7 is one of the new features I had been waiting
for -- I just hadn't tried to compile the AOS Platform VM with it because it
is too bleeding edge for me rely on -- I've been bitten too many times that
way -- particularly with VS releases. I use VS7 where required to work with
.NET related IDE operations, C#, VB, and debugging.

Note that just because it is in a different compilation unit, the compiler
actually does have all the code in hand and my understanding is that it can
also do some level of analysis on the binary-machine code within a library.
I'm really surprised that "Microsoft's" C++ compiler was sophisticated
enough at doing invariant analysis to the point of computing the entire
result at compile time. I guess all the Microsoft dollars spent by their
research group on advanced languages (including ML) has paid off and was
incorporated -- very sweet -- I had not tested for this case.

>
> Note that it is REALLY smart! I played with it and I could get it to a
point
> that it did not optimize the call out completelly but noted that triangle
had no
> side effect and always returned the same result for the same input so ...
it
> translated
>
>  for ( int j=0; j<10; j++ ) {
>   int sum=0;
>   for (int i = 0; i != 9000000; ++i)
>    sum += triangle( 10 );
>   tot+=sum / 1000000;
>  }
>
> into something like
>
>  tot+= triangle( 10 )*9;
>
> Not bad for an optimizer!

I would add that I've actually seen VS6 do seemingly similarly impressive
static analysis and reduction -- especially with inlining --including in
early VS6 before SP2 releases being too agressive with register coloring and
getting it wrong ;-).  It begins to rival some of the better FORTRAN
optimizers -- but there are much better hardcore C++ compilers and VS6. I
don't know about VS7, although, given some of the Microsoft.NET platform
numerics I can see why it might have been worth investing a lot in the
static analysis :(.

A reasonable implementation of an adapative inlining JIT should have been
able to analyze the entire Smalltalk code set into a single computed
result -- and minimally should have been able to convert the 10 triangle
invariant away by moving out of the loop.  It is probably worth obtaining a
copy of SELF and trying it -- or seeing how this performs in Java on a
HotSpot VM.

However, at the end of the day, the test we were originally interested in
were not in comparing static compilation invariant analysis. After all,
static analysis should be able to compute the entire problem at compile
time -- which, surprisingly, VS7 was actually capable of doing. That was
where my comment in my original post about how a SmallScript JIT with
adaptive inlining, invoking the call with a single argument (the nLoops
parameter) would have run the entire test in 95ms.

In my tests I wrote the C++ code as a series of functions. I built this as a
DLL and then invoked it directly from SmallScript.

static int triangle(int nInnerSum);
extern "C" DLLExport
void TimeTriangle(int nLoops, int nInnerSumInterations);

Where the central body of TimeTriangle was:

{
...
        for(int i = 0; i != nLoops; ++i)
            sum += triangle(nInnerSumInterations);
...
}

I'm assuming that VS7 would move (hoist) the triangle() invocation outside
the loop and substitute its computed value. So, build the test with VS6 or
one of the GNU compilers. Or see if you can turn off the loop invariant
analyzer or introduce some volatile constructs or place the
nInnerSumInterations into a shared global to throw off the compiler's
aliasing analyzer.

The static analysis and subsequent computation is not testing the actual
numeric performance issues we were interested in, although the loop
invariant hoisting is.

-- Dave S.

>
> --
>
> Andrea Ferro
>
> ---------
> Brainbench C++ Master. Scored higher than 97% of previous takers
> Scores: Overall 4.46, Conceptual 5.0, Problem-Solving 5.0
> More info http://www.brainbench.com/transcript.jsp?pid=2522556