I think there is nothing quite broken for PyPy. It just has a very long warm-up time.
I think the jit has warmed up, for a few different reasons. Maybe some of them are leading me astray though.
That means every additional run takes only 5 seconds.
Right. The way I interpret that is no matter how many more copies of data added, the jit is not getting measurably faster, i.e. it's warmed up.
Second, diz prints out progress information at regular intervals, and it noticeably speeds up. The jit is certainly kicking in. Cool.
Third, there are traces listed by jitviewer showing asm code generated. Again, clearly the jit is doing it's thing.
Perhaps you're implying that the jit is constantly improving the code and will do better with a longer run. OK, there's no shortage of larger files to try. When I try the 10MB file dickens from http://www.data-compression.info/Corpora/SilesiaCorpus/index.htm which is over 10x longer and look at the asm code generated (I'm looking at the various traces of output() at line 84) then the code gen remains the "same" as when the shorter frank.txt is used, where same means the parts that seem large still seem large. I do see a new trace "line 84 run 13165807 times" so the jit is still changing things - I just can't see the improvement.
I think the jit is doing it's thing and more time doesn't change much. If you want me to try larger files I will.
I think there is nothing quite broken for PyPy.
OK. The Pypy asm is working fine, certainly correct, and faster than CPython, but it's still somewhat lengthlier than I was expecting. Could you explain why a couple parts use much larger alternatives to what I'd expect to see?
For line 84 "if (self.low ^ self.high) & 0x80000000 == 0:" I was hopingfor code along the lines of:
(Linux, 64 bit)
LOAD_ATTR low LOAD_ATTR high mov r10,QWORD PTR [rbx+0x8] mov r11,QWORD PTR [rbx+0x16] BINARY_XOR mov rax, r10 mov rdx, r11 xor rax, rdx BINARY_AND and rax, 0x80000000 COMPARE_OP == jnz after_if
I see both BINARY_XOR and BINARY_AND call a function instead of xor and and. Why? Is there something I can change in my code to let those instructions be used instead? Can xoring two ints really cause an exception?
BINARY_XOR p120 = call(ConstClass(rbigint.xor), p118, p119, descr=<Callr 8 rr EF=3>) mov QWORD PTR [rbp-0x170],rdx mov QWORD PTR [rbp-0x178],rax mov rdi,r15 mov rsi,r13 mov r11d,0x26db2a0 call r11 guard_no_exception(descr=<Guard239>) cmp QWORD PTR ds:0x457d6e0,0x0 jne 0x3aed3fa2
The load of self.low seems involved as well. Is there something in the diz code that causes pypy to think it could ever be None/Null? Is the map lookup for the location of low in self ever condensed to a single instruction? It seems like the location is calculated using the map at every self.low LOAD_ATTR. Isn't the point of the map that the "slot" is always the same and could be baked into the load assembly instruction?
so my fantasy
LOAD_ATTR low mov r10,QWORD PTR [rbx+0x8]
LOAD_ATTR low p33 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst_map
mov r10,QWORD PTR [rbx+0x30] guard(p33 == ConstPtr(ptr34))
jne 0x3aed3ac1 p35 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst__value0
mov r10,QWORD PTR [rbx+0x8] guard_nonnull_class(p35, ConstClass(W_LongObject), descr=<Guard213>)
cmp DWORD PTR [r10],0x1308
On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and itneeds longs.
Where are numbers larger than 32 bits ever assigned to self.low or self.high? I agree that some expressions have temporaries larger than 32 bit, but they're always reduced back to 32 bit before being stored. If I missed an offending line and fix it to stay within 32 bits, will pypy go with it? This seems quite important for Windows which only has a 32 bit version.
an alternate implementation for longs, for example fornumbers that fit into two regular-sized integers.
Too complicated for my needs. My code can be happy with only 32 bits, I just want Pypy to be happy too.
Thanks Armin, -Roger
On Thu, Feb 28, 2013 at 6:00 PM, Roger Flores firstname.lastname@example.org wrote:
OK then. Unzip it, grab a text file large enough to warm up the jit, and run the line to generate the log for jitviewer.
I think there is nothing quite broken for PyPy. It just has a very long warm-up time. On my 64-bit laptop, it runs frank.txt in 14+10 seconds. If I replace frank.txt by 5 concatenated copies of it, it takes 34+29 seconds. That means every additional run takes only 5 seconds. For comparison the CPython time, also on 64-bit, is 21+21 seconds for frank.txt. It's not very clear why, but the warm-up time can be much smaller or bigger depending on the example; it's some current and future research to improve on that aspect.
On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and it needs longs. On PyPy we could in theory improve for that case, e.g. by writing an alternate implementation for longs, for example for numbers that fit into two regular-sized integers. (There is actually one implementation for that, but it's not complete and limited so far, so not enabled by default; see pypy.objspace.std.smalllongobject.py. A more complete version would really use two integers, rather than a "long long" integer.)