Integer micro-benchmarks [Smalltalk Multi-Precision Numerics vis-a-vis Highly Optimized Fixed Size Integer C++ Numerics]

Andrea Ferro AndreaF at UrkaDVD.it
Thu Apr 26 19:17:22 EDT 2001


"David Simmons" <pulsar at qks.com> wrote in message
news:unqF6.19479$Jh5.19552912 at news1.rdc1.sfba.home.com...
>
...
> Better still for our true testing purposes of raw numerics performance, what
> a hand coded assembly routine would do. I probably should hand code an
> assembly version since I'm really expert in that area and issues of
> pipelining -- I have often had to do this for small hot-spot areas where a
> C++ (or other static language) compiler just wasn't satisfactory.

There's not much data cache work involved here. It's most register stuff. And
code is so tiny that should go to cache at first pass.

I'm talking assembler/C/C++ here, not smalltalk, obviously.

You can trick around the triangle algorithm by changing it to (x*x+x)>>1 but
this is just a bit too much to ask from any optimizer (beside the problem of
overflow).

I've posted the VC7 generated assembly (removed all the SEGMENT, ASSUME and
other meta stuff) just a couple of minutes ago.

> > However note: Smalltalk always has all the code at hand.
>
> As you probably know, unlike C++ the Smalltalk compiler can't optimize
> (inline) messages away because it can't rely on method definitions being
> invariant. I.e., between the time a method is compiled and when it is
> invoked, the methods it references may be replaced with new versions. Thus,
> even if type inferencing could ascertain the correct binding information
> (which is possible in this example), it doesn't help.
>
> The only place where type information and bindings can be reliably and
> safely inlined, given a dynamic and adaptive versioning execution
> architecture, is in the JIT which has the knowledge and timing to know when
> an inlined method has become stale and regenerate (dynamically) the inline
> references.
>
> This kind of virtual machine execution architecture not only means that
> adaptive compilation leads to more version/change tolerant systems, but also
> means that, in the case where there are variants, the JIT can use
> heuristically acquired variant knowledge to (simplify and) optimize for
> actual use cases which a "pure" static compilation analysis could not.

That is exactly why I would expect smalltalk to be either slower than it is (if
it does things over again as it did in the 70s) or much faster. Someone was
surprised to see your SmallScript benchmarks. I was not. I know nothing of
SmallScript, but that it is being done right now. And that means that, having
the time and budget to do it at state of the art, it can actually do VERY FAST.

Pushing your reasoning to the extreme it could find out at run time that
triangle is invariant for invariant arguments and once it called it once it
could optimize the loop with a multiplication. However there's a balance to put
in the equation. That kind of heuristic would be itself expensive and not all
programs are just tight loops. This takes me to a curiosity of mines.

Fiew years ago I heard rumours of experiments to add some sort of type awareness
to Smalltalk. Now, making smalltalk statically typed is a nonsense. But
*allowing* type specification of parameters, return types and some other
semantic aids (like the concept of constness to indicate a method is not
supposed to modify the object) could maybe enable very optimized execution. What
is the *current* smalltalk community position on this sort of reasoning.

/DisclaimerOn
I do not want to start a flame on this. I'm not suggesting in any way that
Smalltalk should be typed. I'm just asking what reasoning is behind NOT even
allowing that sort of meta information. Smalltalk now has namespaces and parcels
and protocols and other usefull stuff that was no-no for Smaltalkers in the
past. An *optional* meta info that tells to the JIT and to the compiler that a
method has no side effects or that it returns an object guaranteed to be of a
given class should not actually change the language. Or does it?
/DisclaimerOff

> Global optimization in VC7 is one of the new features I had been waiting
> for -- I just hadn't tried to compile the AOS Platform VM with it because it
> is too bleeding edge for me rely on -- I've been bitten too many times that
> way -- particularly with VS releases. I use VS7 where required to work with
> .NET related IDE operations, C#, VB, and debugging.
>
> Note that just because it is in a different compilation unit, the compiler
> actually does have all the code in hand and my understanding is that it can
> also do some level of analysis on the binary-machine code within a library.
> I'm really surprised that "Microsoft's" C++ compiler was sophisticated
> enough at doing invariant analysis to the point of computing the entire
> result at compile time. I guess all the Microsoft dollars spent by their
> research group on advanced languages (including ML) has paid off and was
> incorporated -- very sweet -- I had not tested for this case.

It does it on triangle. Then it can do it on a loop that sums triangle results
(say 9000000 times). If you put the third loop on that ... somehow it doesnt. So
it is not perfect. No clue why. But still it's pretty good.

For example with this code (same if you put it in different files or all in
one):

int triangle( int x )
{
 int result = 0;
 for (int i = 1; i <= x; ++i)
  result += i;
 return result;
}
int loop()
{
 int sum=0;
 for (int i = 0; i != 9000000; ++i)
  sum += triangle( 10 );
 return sum;
}
int callLoop()
{
 int tot = 0;
 for ( int j=0; j!=10; ++j )
  tot += (loop() / 1000000);
 return tot;
}

the assembly generated is:

?triangle@@YIHH at Z PROC NEAR    ; triangle, COMDAT
 mov ecx, 1
 xor eax, eax
 cmp edx, ecx
 jl SHORT $L274
 npad 5
$L272:
 add eax, ecx
 inc ecx
 cmp ecx, edx
 jle SHORT $L272
$L274:
 ret 0
?triangle@@YIHH at Z ENDP     ; triangle

?loop@@YIHXZ PROC NEAR     ; loop, COMDAT
 mov eax, 495000000    ; 1d8119c0H
 ret 0
?loop@@YIHXZ ENDP     ; loop

?callLoop@@YIHXZ PROC NEAR    ; callLoop, COMDAT
 xor eax, eax
 mov ecx, 10     ; 0000000aH
$L280:
 add eax, 495    ; 000001efH
 dec ecx
 jne SHORT $L280
 ret 0
?callLoop@@YIHXZ ENDP     ; callLoop


Go figure why it did not precompute 4950 in callLoop!!


> I would add that I've actually seen VS6 do seemingly similarly impressive
> static analysis and reduction -- especially with inlining --including in
> early VS6 before SP2 releases being too agressive with register coloring and
> getting it wrong ;-).  It begins to rival some of the better FORTRAN
> optimizers -- but there are much better hardcore C++ compilers and VS6. I
> don't know about VS7, although, given some of the Microsoft.NET platform
> numerics I can see why it might have been worth investing a lot in the
> static analysis :(.

hehe. Yes. VC6 had problems in coloring. And I guess VC7 is not perfect yet too.
But MS is really doing great stuff in this area. And I guess .NET virtual
environment is also not bad.

> A reasonable implementation of an adapative inlining JIT should have been
> able to analyze the entire Smalltalk code set into a single computed
> result -- and minimally should have been able to convert the 10 triangle
> invariant away by moving out of the loop.  It is probably worth obtaining a
> copy of SELF and trying it -- or seeing how this performs in Java on a
> HotSpot VM.

hmmmm I'm still to keep up with Smalltalk "news". I'll take self some time in
the future. I told you, no matter how much I like this sort of languages I
couldn't yet "sell" their use here in Italy. But maybe that's because I'm doing
too much embedded stuff!

> In my tests I wrote the C++ code as a series of functions. I built this as a
> DLL and then invoked it directly from SmallScript.
>
> static int triangle(int nInnerSum);
> extern "C" DLLExport
> void TimeTriangle(int nLoops, int nInnerSumInterations);
>
> Where the central body of TimeTriangle was:
>
> {
> ...
>         for(int i = 0; i != nLoops; ++i)
>             sum += triangle(nInnerSumInterations);
> ...
> }
>
> I'm assuming that VS7 would move (hoist) the triangle() invocation outside
> the loop and substitute its computed value.

For some reason it is not. I maybe overlooking something: it's 1:15 am :-)

--

Andrea Ferro

---------
Brainbench C++ Master. Scored higher than 97% of previous takers
Scores: Overall 4.46, Conceptual 5.0, Problem-Solving 5.0
More info http://www.brainbench.com/transcript.jsp?pid=2522556






More information about the Python-list mailing list