Re: [pypy-dev] Slow int code

March 1, 2013

...
I think there is nothing quite broken for PyPy.  It just has a very long warm-up time.
I think the jit has warmed up, for a few different reasons.  Maybe some of them are leading me astray though.
...
That means every additional run takes only 5 seconds.
Right.  The way I interpret that is no matter how many more copies of data added, the jit is not getting measurably faster, i.e. it's warmed up.
Second, diz prints out progress information at regular intervals, and it noticeably speeds up.  The jit is certainly kicking in.  Cool.

Third, there are traces listed by jitviewer showing asm code generated.  Again, clearly the jit is doing it's thing.

Perhaps you're implying that the jit is constantly improving the code and will do better with a longer run.  OK, there's no shortage of larger files to try.  When I try the 10MB file dickens from http://www.data-compression.info/Corpora/SilesiaCorpus/index.htm which is over 10x longer and look at the asm code generated (I'm looking at the various traces of output() at line 84) then the code gen remains the "same" as when the shorter frank.txt is used, where same means the parts that seem large still seem large.  I do see a new trace "line 84 run 13165807 times" so the jit is still changing things - I just can't see the improvement.

I think the jit is doing it's thing and more time doesn't change much.  If you want me to try larger files I will.
...
I think there is nothing quite broken for PyPy.
OK.  The Pypy asm is working fine, certainly correct, and faster than CPython, but it's still somewhat lengthlier than I was expecting.  Could you explain why a couple parts use much larger alternatives to what I'd expect to see? 
For line 84 "if (self.low ^ self.high) & 0x80000000 == 0:" I was hopingfor code along the lines of:

(Linux, 64 bit) 

LOAD_ATTR low
LOAD_ATTR high
    mov    r10,QWORD PTR [rbx+0x8]    mov    r11,QWORD PTR [rbx+0x16]
BINARY_XOR 
    mov rax, r10
    mov rdx, r11
    xor rax, rdx
BINARY_AND
    and rax, 0x80000000
COMPARE_OP ==
    jnz after_if

I see both BINARY_XOR and BINARY_AND call a function instead of xor and and.  Why?  Is there something I can change in my code to let those instructions be used instead?  Can xoring two ints really cause an exception?

BINARY_XOR
    p120 = call(ConstClass(rbigint.xor), p118, p119, descr=<Callr 8 rr EF=3>)
        mov QWORD PTR [rbp-0x170],rdx 
        mov QWORD PTR [rbp-0x178],rax 
        mov rdi,r15 
        mov rsi,r13 
        mov r11d,0x26db2a0 
        call r11
    guard_no_exception(descr=<Guard239>)
        cmp QWORD PTR ds:0x457d6e0,0x0 
        jne 0x3aed3fa2

The load of self.low seems involved as well.  Is there something in the diz code that causes pypy to think it could ever be None/Null?  Is the map lookup for the location of low in self ever condensed to a single instruction?  It seems like the location is calculated using the map at every self.low LOAD_ATTR.  Isn't the point of the map that the "slot" is always the same and could be baked into the load assembly instruction?

so my fantasy 

LOAD_ATTR low
    mov    r10,QWORD PTR [rbx+0x8]

is really

LOAD_ATTR low
    p33 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst_map

        mov    r10,QWORD PTR [rbx+0x30]     guard(p33 == ConstPtr(ptr34))

        movabs r11,0x7fe837ed7be8 

        cmp    r10,r11 

        jne    0x3aed3ac1     p35 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst__value0

        mov    r10,QWORD PTR [rbx+0x8]     guard_nonnull_class(p35, ConstClass(W_LongObject), descr=<Guard213>)

        cmp    r10,0x1 

        jb     0x3aed3ad2 

        cmp    DWORD PTR [r10],0x1308 

        jne    0x3aed3ad8
...
On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and itneeds longs.
Where are numbers larger than 32 bits ever assigned to self.low or self.high?  I agree that some expressions have temporaries larger than 32 bit, but they're always reduced back to 32 bit before being stored.  If I missed an offending line and fix it to stay within 32 bits, will pypy go with it?  This seems quite important for Windows which only has a 32 bit version.
...
an alternate implementation for longs, for example fornumbers that fit into two regular-sized integers.
Too complicated for my needs.  My code can be happy with only 32 bits, I just want Pypy to be happy too.
Thanks Armin,
-Roger

________________________________

From: Armin Rigo <arigo@tunes.org>
To: Roger Flores <aidembb@yahoo.com> 
Cc: "pypy-dev@python.org" <pypy-dev@python.org> 
Sent: Friday, March 1, 2013 2:34 AM
Subject: Re: [pypy-dev] Slow int code

Hi Roger,

On Thu, Feb 28, 2013 at 6:00 PM, Roger Flores <aidembb@yahoo.com> wrote:
...
OK then. Unzip it, grab a text file large enough to warm up the jit, and run
the line to generate the log for jitviewer.
I think there is nothing quite broken for PyPy.  It just has a very
long warm-up time.  On my 64-bit laptop, it runs frank.txt in 14+10
seconds.  If I replace frank.txt by 5 concatenated copies of it, it
takes 34+29 seconds.  That means every additional run takes only 5
seconds.  For comparison the CPython time, also on 64-bit, is 21+21
seconds for frank.txt.  It's not very clear why, but the warm-up time
can be much smaller or bigger depending on the example; it's some
current and future research to improve on that aspect.

On 32-bit there is extra time needed --- both on PyPy and on CPython
--- because the numbers you use overflow signed 32-bit ints, and it
needs longs.  On PyPy we could in theory improve for that case, e.g.
by writing an alternate implementation for longs, for example for
numbers that fit into two regular-sized integers.  (There is actually
one implementation for that, but it's not complete and limited so far,
so not enabled by default; see pypy.objspace.std.smalllongobject.py.
A more complete version would really use two integers, rather than a
"long long" integer.)

A bientôt,

Armin.