Integer micro-benchmarks [Smalltalk Multi-Precision Numerics vis-a-vis Hig

Tue Apr 24 15:29:14 EDT 2001

"Dave Harris" <brangdon at cix.co.uk> wrote in message
news:memo.20010424181525.43463C at brangdon.madasafish.com...
> pulsar at qks.com (David Simmons) wrote (abridged):
> > I ran two different tests. The first one loops to generate "only"
> > SmallInteger sums; the (900,000) loop. The second test is based on the
> > originally posted sample and performs the 1,000,000 count loop; and
> > thus generates a significant number of large integers.
>
> Does this mean SmallScript's SmallInteger range is smaller than
> Dolphin's? I tried to pick numbers that would give a long enough period
> to measure, but without overflowing the local SmallIntegers.

(I am now writing this comment after having written the sections to this and
the next paragraph of comments in your post. Your line of inquiry about
SmallInteger ranges was right on -- see the my comments on the the "wrong"
reported loop counts in the next section. However, having written this
explanation of tag-bits and SmallInteger representation I thought I would
leave it here anyway).

I think most (if not all) Smalltalk's use the same SmallInteger encoding.

Here's how it works:

On a 32-bit cpu architecture the OOP pointers are 32-bits. Smalltalk is a
dynamically typed language which means that type information is not required
for any expression, and variables are generic (typeless) containers.
Therefore all objects pointers (OOPs) must be self describing. I.e., it must
be possible to ascertain an objects type/class from its pointer. (You
probably knew all this but it doesn't hurt to explain it for clarity).

For certain types of objects their form of use is significant enough that it
justifies attempting some form of optimized representation.

The typical technique is based on recognizing that pointers to "real"
objects will be aligned on some multiple of a power of 2. Typically aligned
on 4-byte or for modern cpu cache efficiency (16-byte or 32-byte)
boundaries. That means that a "real" pointer will always have its least
significant lower bits equal to zero. I.e., aligned to say 16-byte
boundaries means that every "real" pointer must be a multiple of 16 so the
lower 4 bits (2^4 == 16) will be clear.

Recognizing that fact means that a design can cleanly use those bits to
indicate a pointer to a "virtual" object. This technique is called tag-bit
pointers and has been long used in the Smalltalk and Lisp/Scheme communities
to efficiently encode certain object types. Lisp implementations tend to use
more bits (and thus have more virtual tag-bit types) than Smalltalk
implementations but the technique is still the same.

Ok, with that bit a background I can explain the SmallInteger
representation.

On the v4 AOS Platform VM, there are three types of virtual object pointers:
SmallInteger, Character, WeakId. These corespond the various lower (tag-bit)
combinations of (11, 10, 01). The 11 combination is used for SmallInteger
because it has a different parity than the other two forms, which enables
exploitation of the x86 processors parity status tracking and subsequent
branch use. Ideally, a cpu designed to support hi-performance (tag-bit)
virtual machines would have explicit tag-bit instructions and a type lookup
base-register to support hi-perf tag bit actions. [I should also not that
internally the virtual machine may use the XX00 next two bits to mark
various "real" object pointers it passes around internally -- but it is too
complicated to explain why].

This means that out of the 32-bits available for a virtual-object pointer,
two (2) are used for the type/tag bits. Of the remaining 30 bits available
to represent a SmallInteger value, the high-bit (1-bit) is used to hold the
sign. Thus a 30-bit SmallInteger uses a signed 29-bit magnitude represented
in 2's complement notation on the modern processors. This gives positive
numbers the range 0..0x1FFFFFFF and negative numbers the
range -(1...0x20000000). The maximum positive SmallInteger value,
0x1fffffff, is decimal value 536870911 which as an internal SmallInteger OOP
representation as 0x7FFFFFFF.

So, given the original benchmark example, we note that its resultant "sum"
for a loop of 10,000,000 iterations totalled to 550,000,000 and for a loop
of 9,000,000 the sum totalled 495,000,000. From those numbers we can see
that a sum of 495,000,000 is within the SmallInteger domain; but a sum of
550,000,000 exceeds the maximum positive SmallInteger value of 536,870,911.

>
> My loop was actually 10,000,000 iterations, ie 10 times the figures you
> say. The actual numbers shouldn't matter too much, as long as they are
> large enough to reduce timing error.

Ahh, foobar, my error!

In writing the last part of the preceding section's paragraph I see what
happened.

The counts values written in the table in my post are missing a decimal
place relative to the actual loop counts used for the reported times.

When I began the writeup of that post, I looked at the 10000000 number and
the 9000000 number and said to myself, these long numbers are hard to read,
I should put comma's in to break them up. So I prepared a template for
reporting the results of each benchmark. The cut-&-paste template was
written up with loop counts which were missing a decimal place.

I.e., While I was writing the template (with comma spaced count ranges -- to
be helpful in reading) I dropped the zero on each count. I then cut and
pasted that template for each of the different benchmarks. I then went and
spent the majority of the time filling in the template with the accurate set
of reported timing numbers for the 1 decimal place larger counts of
10,000,000 and 9,000,000 loops.

****
The times and ratios reported in my original post are accurate (for the
larger loop counts), they were cut and pasted values from the actual runs
(and I just re-verified them). The times reported in the tables within my
post are actually from runs with loop counts of 10,000,000 and 9,000,000 NOT
the table's reported counts of 1,000,000 and 900,000.

I can't believe I missed this loop count decimal place during my pre-post
proofread. I still have not gotten used to the speed of my new machines
1.2GHz processor and my gut instinct in reading it did not get triggered --
I was focusing my analytical thoughts on the ratio's.
****

>
>
> > NOTE: Use of GetTickCount() has rounding loss that can result in
> > reporting of times which are up to 20ms less than actual time.
>
> On my machine, that would be an error of less than 2%, which I felt was
> insignificant.
>
>
> > Dolphin Smalltalk Professional 4.01
> >     12,086ms for loop over 900,000 <<<< **** 9,000,000 ****
> >         36.404 : 1 Optimized C++ Ratio
>
> I am surprised you got 36:1 here, because my machine was giving 16:1. I'm
> using 4.00, though.

Did you build the C/C++ code using Visual Studio 6 with optimizations?

It really makes a difference which optimization settings you have on and
what C++ compiler you use. Borland C++, gnu, Intel, Microsoft,...

I just re-ran the benchmarks and got the same result.

>
>
> > Cincom VisualWorks 5i3NC
> >     1,457ms for loop over 900,000 <<<< **** 9,000,000 ****
> >         4.3886 : 1 Optimized C++ Ratio
>
> Thanks.
>
>
> > SmallScript v4 AOS Platform VM
> >     1,328ms for loop over 900,000 <<<< **** 9,000,000 ****
> >         4.000 : 1 Optimized C++ Ratio
>
> Impressive. Thanks for the results.

Note that, to be on par with C++ inlining, we want to compare to the inlined
SmallScript version which reports a time of 1163ms which is a 3.472 : 1
ratio.

I should add that cpu code cache and corresponding alignment for jitted code
in methods affects the performance. This means that the same code executed
in a different memory configuration or method layout *may* report different
run times +/- roughly 10%.

>
>
> > I should also point out that this kind of test represents a
> > *worst-case* type of scenario for Smalltalk (dynamic/script
> > language) performance (where it is handling arbitrary/multi-precision
> > arithmetic) vis-a-vis statically typed and highly optimized C++
> > code performing fixed size integer truncated arithmetic.
>
> Yes; the idea was not to compare Smalltalk to fixed-size C++ code, but to
> some variable-sized C++ code. The fixed-size code just provides a
> baseline for measurements between different machines. I'm not sure it's
> working, though, as the Dolphin results vary so much.
>
>   Dave Harris, Nottingham, UK | "Weave a circle round him thrice,
>       brangdon at cix.co.uk      |   And close your eyes with holy dread,
>                               |  For he on honey dew hath fed
>  http://www.bhresearch.co.uk/ |   And drunk the milk of Paradise."