Integer micro-benchmarks [Smalltalk Multi-Precision Numerics vis-a-vis Highly Optimized Fixed Size Integer C++ Numerics]

Mon Apr 23 21:26:10 EDT 2001

This benchmark is actually a reasonable test of compiler (loop) optimization
and integer numerics optimizations.

On a 1.2GHz Athlon processor with:
    512 MB memory
    Dual Raid ATA 100 IBM Hard drives
    Windows 2000 Professional Service Pack 2

I ran two different tests. The first one loops to generate "only"
SmallInteger sums; the (900,000) loop. The second test is based on the
originally posted sample and performs the 1,000,000 count loop; and thus
generates a significant number of large integers.

For each test I launched the application, ran it once and discarded the
result and shutdown. Then I repeated this same procedure three more times
and used the mean of the three runs.

Unoptimized C++ code: (Visual Studio 6 Service Pack 5 w/processor pack 1)
    730ms for loop over 900,000
    808ms for loop over 1,000,000

Best optimized C++ inlined code: (Visual Studio 6 Service Pack 5 w/processor
pack 1)
    332ms for loop over 900,000
        2.1988 : 1 Unoptimized C++ Ratio
    374ms for loop over 1,000,000
        2.1604: 1 Unoptimized C++ Ratio
**
Variations in the 3rd decimal place of above numbers are the result of
windows performance counter precision and OS process/thread slicing
variations.

For the C++ code and SmallScript the QueryPerformanceCounters call was used
to obtain the millisecond timings. Presumably similar timers are used within
VisualWorks and Dolphin (but I don't know).

NOTE: Use of GetTickCount() has rounding loss that can result in reporting
of times which are up to 20ms less than actual time.
**

In the following tables, "shorter" times and "smaller" ratios are better.

In summary, a Smalltalk VM (without using adaptive type based inlining) can
achieve roughly a 4:1 ratio of performance relative to highly optimized C++
code performing "pure" integer numerics. SmallScript's v4 AOS Platform
VM/Jitter *has not* been agressively tuned for numerics or similar
operations so I would expect nominal improvements of some form as it
matures.

To get a sense of what further adaptive inline compilation can achieve I
note that SmallScript on the v4 vAOS Platform VM will execute the 900,000
loop case in 1,215ms when the triangle method is inlined. If we assumed that
the JIT/compiler was capable of hoisting the invariant calculation of 10
triangle out of the loop then we would see a time of 95ms for SmallScript on
the v4 AOS Platform VM.

Remember: The Smalltalk code is performing multi-precision arithmetic and
thus has a significant number of overflow and type checks it performs. An
agressive adaptive inlining JIT compiler could dynamically eliminate many of
the typechecks by calculating the type graph for the code-flow tree and
generating separate versions based on likely types. Presumably this would
also allow most of the intermediate values to be retained in registers
rather than in stack local memory. The resulting ratio for multi-precision
numerics would most likely somwhere around 2 : 1 with highly optimized C++
performing *non-multi-precision-arithmetic*.

I should also point out that this kind of test represents a *worst-case*
type of scenario for Smalltalk (dynamic/script language) performance (where
it is handling arbitrary/multi-precision arithmetic) vis-a-vis statically
typed and highly optimized C++ code performing fixed size integer truncated
arithmetic.
=================================
SmallScript v4 AOS Platform VM
    1,328ms for loop over 900,000
        1.819 : 1 Unoptimized C++ Ratio:
        4.000 : 1 Optimized C++ Ratio
    1,576ms for loop over 1,000,000 (GC tuning for tight memory raises this
to 1,874ms)
        2.159 : 1 Unoptimized C++ Ratio (2.567 : 1)
        4.747 : 1 Optimized C++ Ratio (5.644 : 1)

Cincom VisualWorks 5i3NC
    1,457ms for loop over 900,000
        1.9959 : 1 Unoptimized C++ Ratio:
        4.3886 : 1 Optimized C++ Ratio
    1,789ms for loop over 1,000,000
        2.4507 : 1 Unoptimized C++ Ratio:
        5.3886 : 1 Optimized C++ Ratio

Dolphin Smalltalk Professional 4.01
    12,086ms for loop over 900,000
        16.556 : 1 Unoptimized C++ Ratio:
        36.404 : 1 Optimized C++ Ratio
    13,434ms for loop over 1,000,000
        18.403 : 1 Unoptimized C++ Ratio:
        40.464 : 1 Optimized C++ Ratio

I did not test VisualAge, Squeak, GNU, Gemstone. I did an empirical
(1,000,000) loop test with Smalltalk/X from CampSmalltalk#1 and it ran in
roughly 5212ms (this is only rough because I had to run the test on
different hardware using SmallScript and C++ as a baseline for scaling the
result).

-- Dave Simmons
www.qks.com / www.smallscript.com

"Bob Nemec" <bobn at home.com> wrote in message
news:D72166C0036950F6.D162DFAAECC52CD9.F2A613E2A85334A6 at lp.airnews.net...
> brangdon at cix.co.uk says...
> > To make the measurements less dependant on specific hardware, I suggest
> > we express speeds as a proportion of C++'s speed doing the same thing
but
> > using fixed integers.
> >
> Interesting idea: a standard set of cross-language, cross-platform
> benchmarks.
>
> However, in general I think benchmarks do Smalltalk a disservice.
> Small tight independent chunks of code are not Smalltalk's strength;
> large complex systems are.
>
> The standard argument (which I agree with) is that Smalltalk can scale
> better than any other language, and that the truth of that statement
> becomes more self evident the larger your systems get.
>
> FWIW: I ran your little benchmark on VA, VW, Squeak and GemStone
> (care to publish a Window EXE with your C++ code? ... no compiler on this
> machine).
> The ratios are:
> VW: 1.0
> VA: 3.37
> Sqeak: 18.13
> GS:  25.61
>
> Details...
> "VW 3500"
> #(550000000 3500)
> #(550000000 3475)
> #(550000000 3712)
>
> "VA 11790"
> (550000000 11790)
> (550000000 11797)
> (550000000 11787)
>
> "Squeak 63440"
> #(550000000 63411)
> #(550000000 63471)
>
> "GemStone 89650"
> anArray( 550000000, 88897)
> anArray( 550000000, 90400)
> --
> Bob Nemec
> Newcastle Objects
> bobn at home.com