Hi guys. I've been looking at two simple routines using jitviewer to figure out why they're so much slower than expected. I've also noticed that http://pypy.org/performance.html has the line "Bad examples include doing computations with large longs – which is performed by unoptimizable support code.". I'm worried that my 32 bit int code is falling into this, and I'm wondering what I can do to avoid it? Trivial code like if (self.low ^ self.high) & 0x80000000 == 0: is expanding into several dozen asm instructions. I'm suspecting that lines like self.low = (self.low << 1) & 0xffffffff with it's shift left are convincing the jit to consider the int to need 64 bits (large long?) instead of 32. Ideas? The asm is clearly operating on QWORDs and calling routines to do the bit arithmetic instead of single instructions. Is this what that line in performance.html is warning about? -Roger BTW Fijal's jitviewer is a *must see* for anyone interested in how pypy makes their code fast!
In that context large longs means HUNDREDS or THOUSANDS of bits, not 64 :) Can you show us a full runnable example that illustrates this? Alex On Wed, Feb 27, 2013 at 12:10 PM, Roger Flores <aidembb@yahoo.com> wrote:
Hi guys. I've been looking at two simple routines using jitviewer to figure out why they're so much slower than expected.
I've also noticed that http://pypy.org/performance.html has the line "Bad examples include doing computations with large longs – which is performed by unoptimizable support code.". I'm worried that my 32 bit int code is falling into this, and I'm wondering what I can do to avoid it?
Trivial code like
if (self.low ^ self.high) & 0x80000000 == 0:
is expanding into several dozen asm instructions. I'm suspecting that lines like
self.low = (self.low << 1) & 0xffffffff
with it's shift left are convincing the jit to consider the int to need 64 bits (large long?) instead of 32.
Ideas? The asm is clearly operating on QWORDs and calling routines to do the bit arithmetic instead of single instructions. Is this what that line in performance.html is warning about?
-Roger
BTW Fijal's jitviewer is a *must see* for anyone interested in how pypy makes their code fast!
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
Would you like a paste from jitviewer or the source code to run and examine with jitviewer? -Roger ________________________________ From: Alex Gaynor <alex.gaynor@gmail.com> To: Roger Flores <aidembb@yahoo.com> Cc: "pypy-dev@python.org" <pypy-dev@python.org> Sent: Wednesday, February 27, 2013 12:23 PM Subject: Re: [pypy-dev] Slow int code In that context large longs means HUNDREDS or THOUSANDS of bits, not 64 :) Can you show us a full runnable example that illustrates this? Alex On Wed, Feb 27, 2013 at 12:10 PM, Roger Flores <aidembb@yahoo.com> wrote: Hi guys. I've been looking at two simple routines using jitviewer to figure out why they're so much slower than expected.
I've also noticed that http://pypy.org/performance.html has the line "Bad examples include doing computations with large longs – which is performed by unoptimizable support code.". I'm worried that my 32 bit int code is falling into this, and I'm wondering what I can do to avoid it?
Trivial code like
if (self.low ^ self.high) & 0x80000000 == 0:
is expanding into several dozen asm instructions. I'm suspecting that lines like
self.low = (self.low << 1) & 0xffffffff
with it's shift left are convincing the jit to consider the int to need 64 bits (large long?) instead of 32.
Ideas? The asm is clearly operating on QWORDs and calling routines to do the bit arithmetic instead of single instructions. Is this what that line in performance.html is warning about?
-Roger
BTW Fijal's jitviewer is a *must see* for anyone interested in how pypy makes their code fast!
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
The original source code would be best! Thanks, Alex On Wed, Feb 27, 2013 at 12:32 PM, Roger Flores <aidembb@yahoo.com> wrote:
Would you like a paste from jitviewer or the source code to run and examine with jitviewer?
-Roger
------------------------------ *From:* Alex Gaynor <alex.gaynor@gmail.com> *To:* Roger Flores <aidembb@yahoo.com> *Cc:* "pypy-dev@python.org" <pypy-dev@python.org> *Sent:* Wednesday, February 27, 2013 12:23 PM *Subject:* Re: [pypy-dev] Slow int code
In that context large longs means HUNDREDS or THOUSANDS of bits, not 64 :) Can you show us a full runnable example that illustrates this?
Alex
On Wed, Feb 27, 2013 at 12:10 PM, Roger Flores <aidembb@yahoo.com> wrote:
Hi guys. I've been looking at two simple routines using jitviewer to figure out why they're so much slower than expected.
I've also noticed that http://pypy.org/performance.html has the line "Bad examples include doing computations with large longs – which is performed by unoptimizable support code.". I'm worried that my 32 bit int code is falling into this, and I'm wondering what I can do to avoid it?
Trivial code like
if (self.low ^ self.high) & 0x80000000 == 0:
is expanding into several dozen asm instructions. I'm suspecting that lines like
self.low = (self.low << 1) & 0xffffffff
with it's shift left are convincing the jit to consider the int to need 64 bits (large long?) instead of 32.
Ideas? The asm is clearly operating on QWORDs and calling routines to do the bit arithmetic instead of single instructions. Is this what that line in performance.html is warning about?
-Roger
BTW Fijal's jitviewer is a *must see* for anyone interested in how pypy makes their code fast!
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
I'll email the code separetely because I'm not sure everyone wants a tiny zip. Anyone is welcome to it. It's a newer version of the compressor I entered into the Large Text Compression Benchmark. I'm running it as: PYPYLOG=jit-log-opt,jit-backend:dizlog.pypylog pypy diz.py -p -t frank.txt You can get frank.txt from http://www.gutenberg.org/ebooks/84 (and rename it) or substitute a similar file. Examine the second line in output(): if (self.low ^ self.high) & 0x80000000 == 0: The remaining lines are similar. Also, the routine encode() listed one line above in jitviewer has the same issues. If I comment out the two calls to encode(), I save a huge percentage of time (up to 40% in some configurations). -Roger ________________________________ From: Alex Gaynor <alex.gaynor@gmail.com> To: Roger Flores <aidembb@yahoo.com> Cc: "pypy-dev@python.org" <pypy-dev@python.org> Sent: Wednesday, February 27, 2013 12:35 PM Subject: Re: [pypy-dev] Slow int code The original source code would be best! Thanks, Alex On Wed, Feb 27, 2013 at 12:32 PM, Roger Flores <aidembb@yahoo.com> wrote: Would you like a paste from jitviewer or the source code to run and examine with jitviewer?
-Roger
________________________________ From: Alex Gaynor <alex.gaynor@gmail.com> To: Roger Flores <aidembb@yahoo.com> Cc: "pypy-dev@python.org" <pypy-dev@python.org> Sent: Wednesday, February 27, 2013 12:23 PM Subject: Re: [pypy-dev] Slow int code
In that context large longs means HUNDREDS or THOUSANDS of bits, not 64 :) Can you show us a full runnable example that illustrates this?
Alex
On Wed, Feb 27, 2013 at 12:10 PM, Roger Flores <aidembb@yahoo.com> wrote:
Hi guys. I've been looking at two simple routines using jitviewer to figure out why they're so much slower than expected.
I've also noticed that http://pypy.org/performance.html has the line "Bad examples include doing computations with large longs – which is performed by unoptimizable support code.". I'm worried that my 32 bit int code is falling into this, and I'm wondering what I can do to avoid it?
Trivial code like
if (self.low ^ self.high) & 0x80000000 == 0:
is expanding into several dozen asm instructions. I'm suspecting that lines like
self.low = (self.low << 1) & 0xffffffff
with it's shift left are convincing the jit to consider the int to need 64 bits (large long?) instead of 32.
Ideas? The asm is clearly operating on QWORDs and calling routines to do the bit arithmetic instead of single instructions. Is this what that line in performance.html is warning about?
-Roger
BTW Fijal's jitviewer is a *must see* for anyone interested in how pypy makes their code fast!
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero
Hi Roger, On Wed, Feb 27, 2013 at 9:55 PM, Roger Flores <aidembb@yahoo.com> wrote:
I'll email the code separetely because I'm not sure everyone wants a tiny zip.
I'm sure no-one would mind receiving a tiny zip; or just use a paste site if your program is a single module. (http://bpaste.net/) A bientôt, Armin.
I'm sure no-one would mind receiving a tiny zip;
OK then. Unzip it, grab a text file large enough to warm up the jit, and run the line to generate the log for jitviewer. The issue is the codegen for the output() function, and is there anything about my python code that's unintentionally confusing for pypy? Say, there isn't by chance a command that will show the types that pypy annotates for the classes in a program? That way I could easily check that important data structures are easily and correctly understood type wise? Thanks, -Roger
Hi Roger, On Thu, Feb 28, 2013 at 6:00 PM, Roger Flores <aidembb@yahoo.com> wrote:
OK then. Unzip it, grab a text file large enough to warm up the jit, and run the line to generate the log for jitviewer.
I think there is nothing quite broken for PyPy. It just has a very long warm-up time. On my 64-bit laptop, it runs frank.txt in 14+10 seconds. If I replace frank.txt by 5 concatenated copies of it, it takes 34+29 seconds. That means every additional run takes only 5 seconds. For comparison the CPython time, also on 64-bit, is 21+21 seconds for frank.txt. It's not very clear why, but the warm-up time can be much smaller or bigger depending on the example; it's some current and future research to improve on that aspect. On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and it needs longs. On PyPy we could in theory improve for that case, e.g. by writing an alternate implementation for longs, for example for numbers that fit into two regular-sized integers. (There is actually one implementation for that, but it's not complete and limited so far, so not enabled by default; see pypy.objspace.std.smalllongobject.py. A more complete version would really use two integers, rather than a "long long" integer.) A bientôt, Armin.
I think there is nothing quite broken for PyPy. It just has a very long warm-up time. I think the jit has warmed up, for a few different reasons. Maybe some of them are leading me astray though.
That means every additional run takes only 5 seconds. Right. The way I interpret that is no matter how many more copies of data added, the jit is not getting measurably faster, i.e. it's warmed up.
Second, diz prints out progress information at regular intervals, and it noticeably speeds up. The jit is certainly kicking in. Cool. Third, there are traces listed by jitviewer showing asm code generated. Again, clearly the jit is doing it's thing. Perhaps you're implying that the jit is constantly improving the code and will do better with a longer run. OK, there's no shortage of larger files to try. When I try the 10MB file dickens from http://www.data-compression.info/Corpora/SilesiaCorpus/index.htm which is over 10x longer and look at the asm code generated (I'm looking at the various traces of output() at line 84) then the code gen remains the "same" as when the shorter frank.txt is used, where same means the parts that seem large still seem large. I do see a new trace "line 84 run 13165807 times" so the jit is still changing things - I just can't see the improvement. I think the jit is doing it's thing and more time doesn't change much. If you want me to try larger files I will.
I think there is nothing quite broken for PyPy. OK. The Pypy asm is working fine, certainly correct, and faster than CPython, but it's still somewhat lengthlier than I was expecting. Could you explain why a couple parts use much larger alternatives to what I'd expect to see?
For line 84 "if (self.low ^ self.high) & 0x80000000 == 0:" I was hopingfor code along the lines of: (Linux, 64 bit) LOAD_ATTR low LOAD_ATTR high mov r10,QWORD PTR [rbx+0x8] mov r11,QWORD PTR [rbx+0x16] BINARY_XOR mov rax, r10 mov rdx, r11 xor rax, rdx BINARY_AND and rax, 0x80000000 COMPARE_OP == jnz after_if I see both BINARY_XOR and BINARY_AND call a function instead of xor and and. Why? Is there something I can change in my code to let those instructions be used instead? Can xoring two ints really cause an exception? BINARY_XOR p120 = call(ConstClass(rbigint.xor), p118, p119, descr=<Callr 8 rr EF=3>) mov QWORD PTR [rbp-0x170],rdx mov QWORD PTR [rbp-0x178],rax mov rdi,r15 mov rsi,r13 mov r11d,0x26db2a0 call r11 guard_no_exception(descr=<Guard239>) cmp QWORD PTR ds:0x457d6e0,0x0 jne 0x3aed3fa2 The load of self.low seems involved as well. Is there something in the diz code that causes pypy to think it could ever be None/Null? Is the map lookup for the location of low in self ever condensed to a single instruction? It seems like the location is calculated using the map at every self.low LOAD_ATTR. Isn't the point of the map that the "slot" is always the same and could be baked into the load assembly instruction? so my fantasy LOAD_ATTR low mov r10,QWORD PTR [rbx+0x8] is really LOAD_ATTR low p33 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst_map mov r10,QWORD PTR [rbx+0x30] guard(p33 == ConstPtr(ptr34)) movabs r11,0x7fe837ed7be8 cmp r10,r11 jne 0x3aed3ac1 p35 = ((pypy.objspace.std.mapdict.W_ObjectObjectSize5)p10).inst__value0 mov r10,QWORD PTR [rbx+0x8] guard_nonnull_class(p35, ConstClass(W_LongObject), descr=<Guard213>) cmp r10,0x1 jb 0x3aed3ad2 cmp DWORD PTR [r10],0x1308 jne 0x3aed3ad8
On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and itneeds longs.
Where are numbers larger than 32 bits ever assigned to self.low or self.high? I agree that some expressions have temporaries larger than 32 bit, but they're always reduced back to 32 bit before being stored. If I missed an offending line and fix it to stay within 32 bits, will pypy go with it? This seems quite important for Windows which only has a 32 bit version.
an alternate implementation for longs, for example fornumbers that fit into two regular-sized integers. Too complicated for my needs. My code can be happy with only 32 bits, I just want Pypy to be happy too.
Thanks Armin, -Roger ________________________________ From: Armin Rigo <arigo@tunes.org> To: Roger Flores <aidembb@yahoo.com> Cc: "pypy-dev@python.org" <pypy-dev@python.org> Sent: Friday, March 1, 2013 2:34 AM Subject: Re: [pypy-dev] Slow int code Hi Roger, On Thu, Feb 28, 2013 at 6:00 PM, Roger Flores <aidembb@yahoo.com> wrote:
OK then. Unzip it, grab a text file large enough to warm up the jit, and run the line to generate the log for jitviewer.
I think there is nothing quite broken for PyPy. It just has a very long warm-up time. On my 64-bit laptop, it runs frank.txt in 14+10 seconds. If I replace frank.txt by 5 concatenated copies of it, it takes 34+29 seconds. That means every additional run takes only 5 seconds. For comparison the CPython time, also on 64-bit, is 21+21 seconds for frank.txt. It's not very clear why, but the warm-up time can be much smaller or bigger depending on the example; it's some current and future research to improve on that aspect. On 32-bit there is extra time needed --- both on PyPy and on CPython --- because the numbers you use overflow signed 32-bit ints, and it needs longs. On PyPy we could in theory improve for that case, e.g. by writing an alternate implementation for longs, for example for numbers that fit into two regular-sized integers. (There is actually one implementation for that, but it's not complete and limited so far, so not enabled by default; see pypy.objspace.std.smalllongobject.py. A more complete version would really use two integers, rather than a "long long" integer.) A bientôt, Armin.
Hi Roger, On Fri, Mar 1, 2013 at 10:13 PM, Roger Flores <aidembb@yahoo.com> wrote:
I see both BINARY_XOR and BINARY_AND call a function instead of xor and and. Why? Is there something I can change in my code to let those instructions be used instead? Can xoring two ints really cause an exception?
If it becomes calls to, I guess, rbigint.xor and rbigint.and, it's because these are objects of the Python type "long". PyPy must respect the Python semantics: if you do for example "(1 << 80) >> 80", the result is not 1, but 1L. This said, I guess that the code is expected to manipulate 32-bit unsigned integers, and so should not overflow to long on 64-bit machines. You may want to investigate, in the normal Python way, why you get "long" objects there --- this is not a particularity of PyPy; you can investigate on CPython if you prefer. If you're running this code on 32-bit machines, then no chance: you're getting "long"s. The only solution would be along the lines I described before, with a W_TwoIntsLongObject, or even a W_UnsignedIntLongObject that can store all long objects between 0 and 2**32-1. A bientôt, Armin.
Thanks Armin. This explains a lot. I get it better now. -Roger ----- Original Message ----- From: Armin Rigo <arigo@tunes.org> To: Roger Flores <aidembb@yahoo.com> Cc: "pypy-dev@python.org" <pypy-dev@python.org> Sent: Friday, March 1, 2013 2:20 PM Subject: Re: [pypy-dev] Slow int code Hi Roger, On Fri, Mar 1, 2013 at 10:13 PM, Roger Flores <aidembb@yahoo.com> wrote:
I see both BINARY_XOR and BINARY_AND call a function instead of xor and and. Why? Is there something I can change in my code to let those instructions be used instead? Can xoring two ints really cause an exception?
If it becomes calls to, I guess, rbigint.xor and rbigint.and, it's because these are objects of the Python type "long". PyPy must respect the Python semantics: if you do for example "(1 << 80) >> 80", the result is not 1, but 1L. This said, I guess that the code is expected to manipulate 32-bit unsigned integers, and so should not overflow to long on 64-bit machines. You may want to investigate, in the normal Python way, why you get "long" objects there --- this is not a particularity of PyPy; you can investigate on CPython if you prefer. If you're running this code on 32-bit machines, then no chance: you're getting "long"s. The only solution would be along the lines I described before, with a W_TwoIntsLongObject, or even a W_UnsignedIntLongObject that can store all long objects between 0 and 2**32-1. A bientôt, Armin.
On 03/01/2013 10:13 PM, Roger Flores wrote:
I think there is nothing quite broken for PyPy. It just has a very long warm-up time. I think the jit has warmed up, for a few different reasons. Maybe some of them are leading me astray though.
That means every additional run takes only 5 seconds. Right. The way I interpret that is no matter how many more copies of data added, the jit is not getting measurably faster, i.e. it's warmed up.
Second, diz prints out progress information at regular intervals, and it noticeably speeds up. The jit is certainly kicking in. Cool.
Third, there are traces listed by jitviewer showing asm code generated. Again, clearly the jit is doing it's thing.
Perhaps you're implying that the jit is constantly improving the code and will do better with a longer run. OK, there's no shortage of larger files to try. When I try the 10MB file dickens from http://www.data-compression.info/Corpora/SilesiaCorpus/index.htm which is over 10x longer and look at the asm code generated (I'm looking at the various traces of output() at line 84) then the code gen remains the "same" as when the shorter frank.txt is used, where same means the parts that seem large still seem large. I do see a new trace "line 84 run 13165807 times" so the jit is still changing things - I just can't see the improvement.
I think the jit is doing it's thing and more time doesn't change much. If you want me to try larger files I will.
I think there is nothing quite broken for PyPy. OK. The Pypy asm is working fine, certainly correct, and faster than CPython, but it's still somewhat lengthlier than I was expecting. Could you explain why a couple parts use much larger alternatives to what I'd expect to see?
For line 84 "if (self.low ^ self.high) & 0x80000000 == 0:" I was hopingfor code along the lines of:
(Linux, 64 bit)
LOAD_ATTR low LOAD_ATTR high mov r10,QWORD PTR [rbx+0x8] mov r11,QWORD PTR [rbx+0x16] BINARY_XOR mov rax, r10 mov rdx, r11 xor rax, rdx BINARY_AND and rax, 0x80000000 COMPARE_OP == jnz after_if
I see both BINARY_XOR and BINARY_AND call a function instead of xor and and. Why? Is there something I can change in my code to let those instructions be used instead? Can xoring two ints really cause an exception?
BINARY_XOR p120 = call(ConstClass(rbigint.xor), p118, p119, descr=<Callr 8 rr EF=3>) mov QWORD PTR [rbp-0x170],rdx mov QWORD PTR [rbp-0x178],rax mov rdi,r15 mov rsi,r13 mov r11d,0x26db2a0 call r11 guard_no_exception(descr=<Guard239>) cmp QWORD PTR ds:0x457d6e0,0x0 jne 0x3aed3fa2
Are you *sure* you are running on a 64 bit machine? When I run diz.py on a 64 bit machine, the BINARY_XOR bytecodes turn into int_xor low level operations, as expected. On 32 bit I get the call, as you pasted. Anyway, to debug where low and high turn into Python longs, you can put the following properties in arithmetic32.Encoder: def get_low(self): return self._low def set_low(self, val): assert isinstance(val, int) self._low = val low = property(get_low, set_low) def get_high(self): return self._high def set_high(self, val): assert isinstance(val, int) self._high = val high = property(get_high, set_high) Indeed, they didn't trigger on 64 bit for me. In 32 bit they obviously trigger because of the line self.low = (self.low << 1) & 0xffffffff (0xffffffff is a Python long on 32 bit). Cheers, Carl Friedrich
participants (4)
-
Alex Gaynor
-
Armin Rigo
-
Carl Friedrich Bolz
-
Roger Flores