
hi, i'm currently trying to find a way to generate simd code for python loops that operate on numpy arrays. The IRC is quite busy so I think I'd rather post it as an email... I have familiarized myself with pypy and read most of the docs and read the code in the metainterp to understand how the traces are built. I translated pypy with --withmod-micronumpy enabled (installed numpy pypy fork in an virtual env) and gave the jitviewer a try. Here is a sample program that does not really make sense, but I think it would contain opportunity to generate SIMD instructions. ``` import numpy def invoke(): a = numpy.zeros(1024) b = numpy.zeros(1024) c = numpy.zeros(1024) # (2) for i in range(0,1024): c[i] = a[i] + b[i] # (3) c = a + b if __name__ == '__main__': i = 0 # (1) while i < 500: invoke() i += 1 ``` Here is the trace of invoke visible in jitviewer (uploaded to https://postimg.org/image/kbntluw55/full/). Here are some questions I have that would really help me to get going: (1) Is there a better way to get loops hot? (2) I cannot really figure out what all those trace/loop parameters are. obviously i can guess some but most of them do not really make sense to me. Am I missing some settings to be able to display more information for the trace? In addition to that I do not really see any chance to generate a simd loop for (2). Every array access is (arguably) guarded by an array index overflow and I think to skip that check would be invalid. (3) There is no trace generated for this instruction? Does this internally call a c function? (4) What in our opinion makes sense to vectorize with simd instructions? Could you provide some sample loops/code (ranging from simple to complex loops)? I would really appreciate any help and/or push into the right direction. Best, Richard

Sorry, I put https instead of http: http://postimg.org/image/kbntluw55/full/ On 02/24/2015 07:36 PM, Maciej Fijalkowski wrote:

Hi Richard. I will respond inline On Tue, Feb 24, 2015 at 8:18 PM, Richard Plangger <rich@pasra.at> wrote:
no, not really (you can lower the threshold though, see pypy --jit help for details, only global)
Not really, no, you can't get a better info, but I can probably explain Some of the values can be easily promoted to constants (array sizes for multiplication for example). What you can do for array bound checks is to unroll this loop e.g. 4 times and do ONE array bound check instead of 4 (checking x + 4). Actually it's possible to have an optimization that removes all counters and has ONE number to track the iteration (it's not that hard if the arrays are similar, you can precompute if they match).
(3) There is no trace generated for this instruction? Does this internally call a c function?
No, the trace follows the loop and does not compile the + (but the + has it's own loop anyway)
I can rewrite this loop in a vectorizable way if you want

On 02/24/2015 11:17 PM Maciej Fijalkowski wrote:
Files could be virtually large, since UIAM physical allocation does not occur until write, or at least is controllable. The idea then would be to write programs with a warm-up prelude function, and then have a checkpointing module with a method that could write a specially stubbed ELF file along with all the file data, so that the ELF would be an executable whose _start would get back to the checkpoint module where everything would be restored as it was checkpointed, and execution would continue as if just returning from the call to the checkpointing method, which would be after the forced warmup prelude. Sorry if I am intruding. Regards, Bengt Richter

Hello Bengt, If I'm not mistaken I think this FAQ item should at least partly answer your question : http://pypy.readthedocs.org/en/latest/faq.html#couldn-t-the-jit-dump-and-rel... Regards On Wed, Feb 25, 2015 at 1:12 AM, Bengt Richter <bokr@oz.net> wrote:
-- Vincent Legoll

Hi Vincent, I was aware of the FAQ item (my similar question long ago may have helped put it in the FAQ ;-) AIUI the main issue is re-establishing the memory mapping, which I think could be re-established by mmap-ing the saved files, if those files were created through mmap in the first place (along with what lsofs might show at checkpoint time. But in order to capture memory, malloc and the gc would have to be reworked to operate on memory by appending zeroes to mmap-ed files serving as memory pools. If the program is running single threaded at the time the checkpoint method is called, there wouldn't be an issue of restoring blocked multiple threads. I would guess the remaining things would be the state of open files, but IWT their state could be saved and reopens and seeks could be done on resume start-up. A question would be if the jit discards warmed-up code so that it would not be seen on resume, but how would it know there wasn't going to be an ordinary return from the checkpointing call? With multiple mmap-ed files serving as memory pools, maybe they could even be put in a hash-identified directory and gzipped for re-setup by the ELF resumption stub. The idea for an elf stub would be to write a c program with ELF data space such that you could copy it and append resumption data to result in data space seen by the resuming program. Something on the idea of peCOFF boot stub for the linux kernel, but just for the individual pypy-warm-resume (at a minimum the gzipped mmap files archive name if the latter is not actually appended to the stup program). My hunch is that between the stackless guy Christian and Armin, they could figure it out ;-) Maybe it's worth a re-think, if only to say "no, we really mean no" in the FAQ ;-) Regards, Bengt Richter On 02/25/2015 07:14 AM Vincent Legoll wrote:

Hi Bengt, On 25 February 2015 at 15:20, Bengt Richter <bokr@oz.net> wrote:
Maybe it's worth a re-think, if only to say "no, we really mean no" in the FAQ ;-)
It's unclear to me if you're suggesting something more than adding a checkpointing system to stop and restart the process. It's a hack that might work in some cases and not in others. For example, re-opening the files and re-seeking them to the same position as before --- it seems interesting but I somehow doubt it is very helpful in general. Another point of view on your suggestion is "just use os.fork()", as this solves all these hard issues out of the box. Of course you can't save the forked process to disk, but that seems like a small price to pay and it works on today's PyPy. A bientôt, Armin.

I coincidentally read something related yesterday, it's not completely on topic, but was an interesting bit of info about firefox OS JIT and the use of fork() to amortize some costs. You can start reading at : http://www.realworldtech.com/forum/?threadid=146598&curpostid=147184 That's a digression in a (long but interesting) thread about the Mill CPU architecture (currently non existent) On Sat, Feb 28, 2015 at 6:23 PM, Armin Rigo <arigo@tunes.org> wrote:
-- Vincent Legoll

Thx for your response. Yes I would be interested in a rewritten loop. After browsing the jittracer longer I found the following. ``` label(p0, f16, p2, p3, p55, i54, i59, p20, i18, p6, i24, p36, i34, i40, f32, i15, i23, i31, i39, i48, i58, i60, descr=TargetToken(41748336)) (numpy_call2: no get_printable_location) f62 = raw_load(i15, i24, descr=<ArrayF 8>) guard_not_invalidated(descr=<Guard0x246fe60>) i63 = i24 + i23 f64 = raw_load(i31, i40, descr=<ArrayF 8>) i65 = i40 + i39 f66 = f62 + f64 raw_store(i48, i59, f66, descr=<ArrayF 8>) i67 = i54 + 1 i68 = i59 + i58 i69 = i67 >= i60 guard(i69 is false) show bridge (run 2799 times, ~0%) (numpy_call2: no get_printable_location) jump(p0, f62, p2, p3, p55, i67, i68, p20, i18, p6, i63, p36, i34, i65, f64, i15, i23, i31, i39, i48, i58, i60, descr=TargetToken(41748336)) ``` Is this the trace that calculates a = b + c internally? This looks much more like what I have had in mind. Here some questions that I have: 1) That does the prefix p/i mean in the label arguments? I guess it is a box of i -> integer, p -> pointer, but why does raw_load(i31, ...) operate on an i31 instead of p31? 2) Why does every raw_load/raw_store has its own index calculation (e.g. i63 = i40 + i39) instead of just using one common index? 3) Are the numpy arrays or python arrays aligned in memory? If not would it be hard to change their alignment when they are allocated? Best, Richard On 02/24/2015 11:17 PM, Maciej Fijalkowski wrote:

On Wed, Feb 25, 2015 at 11:28 AM, Richard Plangger <rich@pasra.at> wrote:
yes
p is a gc pointer, i is an integer or raw pointer (we should invent new name for it, it's biting us in the JIT)
2) Why does every raw_load/raw_store has its own index calculation (e.g. i63 = i40 + i39) instead of just using one common index?
because we did not optimize it correctly ;-) it's about specializing the loop and detecting you can reuse the same indexing
3) Are the numpy arrays or python arrays aligned in memory? If not would it be hard to change their alignment when they are allocated?
we control the alignment (I think they're 8 byte aligned now, but can be made 16). Remember that numpy arrays don't have to be contiguous so you have to be a bit careful

Sorry, I put https instead of http: http://postimg.org/image/kbntluw55/full/ On 02/24/2015 07:36 PM, Maciej Fijalkowski wrote:

Hi Richard. I will respond inline On Tue, Feb 24, 2015 at 8:18 PM, Richard Plangger <rich@pasra.at> wrote:
no, not really (you can lower the threshold though, see pypy --jit help for details, only global)
Not really, no, you can't get a better info, but I can probably explain Some of the values can be easily promoted to constants (array sizes for multiplication for example). What you can do for array bound checks is to unroll this loop e.g. 4 times and do ONE array bound check instead of 4 (checking x + 4). Actually it's possible to have an optimization that removes all counters and has ONE number to track the iteration (it's not that hard if the arrays are similar, you can precompute if they match).
(3) There is no trace generated for this instruction? Does this internally call a c function?
No, the trace follows the loop and does not compile the + (but the + has it's own loop anyway)
I can rewrite this loop in a vectorizable way if you want

On 02/24/2015 11:17 PM Maciej Fijalkowski wrote:
Files could be virtually large, since UIAM physical allocation does not occur until write, or at least is controllable. The idea then would be to write programs with a warm-up prelude function, and then have a checkpointing module with a method that could write a specially stubbed ELF file along with all the file data, so that the ELF would be an executable whose _start would get back to the checkpoint module where everything would be restored as it was checkpointed, and execution would continue as if just returning from the call to the checkpointing method, which would be after the forced warmup prelude. Sorry if I am intruding. Regards, Bengt Richter

Hello Bengt, If I'm not mistaken I think this FAQ item should at least partly answer your question : http://pypy.readthedocs.org/en/latest/faq.html#couldn-t-the-jit-dump-and-rel... Regards On Wed, Feb 25, 2015 at 1:12 AM, Bengt Richter <bokr@oz.net> wrote:
-- Vincent Legoll

Hi Vincent, I was aware of the FAQ item (my similar question long ago may have helped put it in the FAQ ;-) AIUI the main issue is re-establishing the memory mapping, which I think could be re-established by mmap-ing the saved files, if those files were created through mmap in the first place (along with what lsofs might show at checkpoint time. But in order to capture memory, malloc and the gc would have to be reworked to operate on memory by appending zeroes to mmap-ed files serving as memory pools. If the program is running single threaded at the time the checkpoint method is called, there wouldn't be an issue of restoring blocked multiple threads. I would guess the remaining things would be the state of open files, but IWT their state could be saved and reopens and seeks could be done on resume start-up. A question would be if the jit discards warmed-up code so that it would not be seen on resume, but how would it know there wasn't going to be an ordinary return from the checkpointing call? With multiple mmap-ed files serving as memory pools, maybe they could even be put in a hash-identified directory and gzipped for re-setup by the ELF resumption stub. The idea for an elf stub would be to write a c program with ELF data space such that you could copy it and append resumption data to result in data space seen by the resuming program. Something on the idea of peCOFF boot stub for the linux kernel, but just for the individual pypy-warm-resume (at a minimum the gzipped mmap files archive name if the latter is not actually appended to the stup program). My hunch is that between the stackless guy Christian and Armin, they could figure it out ;-) Maybe it's worth a re-think, if only to say "no, we really mean no" in the FAQ ;-) Regards, Bengt Richter On 02/25/2015 07:14 AM Vincent Legoll wrote:

Hi Bengt, On 25 February 2015 at 15:20, Bengt Richter <bokr@oz.net> wrote:
Maybe it's worth a re-think, if only to say "no, we really mean no" in the FAQ ;-)
It's unclear to me if you're suggesting something more than adding a checkpointing system to stop and restart the process. It's a hack that might work in some cases and not in others. For example, re-opening the files and re-seeking them to the same position as before --- it seems interesting but I somehow doubt it is very helpful in general. Another point of view on your suggestion is "just use os.fork()", as this solves all these hard issues out of the box. Of course you can't save the forked process to disk, but that seems like a small price to pay and it works on today's PyPy. A bientôt, Armin.

I coincidentally read something related yesterday, it's not completely on topic, but was an interesting bit of info about firefox OS JIT and the use of fork() to amortize some costs. You can start reading at : http://www.realworldtech.com/forum/?threadid=146598&curpostid=147184 That's a digression in a (long but interesting) thread about the Mill CPU architecture (currently non existent) On Sat, Feb 28, 2015 at 6:23 PM, Armin Rigo <arigo@tunes.org> wrote:
-- Vincent Legoll

Thx for your response. Yes I would be interested in a rewritten loop. After browsing the jittracer longer I found the following. ``` label(p0, f16, p2, p3, p55, i54, i59, p20, i18, p6, i24, p36, i34, i40, f32, i15, i23, i31, i39, i48, i58, i60, descr=TargetToken(41748336)) (numpy_call2: no get_printable_location) f62 = raw_load(i15, i24, descr=<ArrayF 8>) guard_not_invalidated(descr=<Guard0x246fe60>) i63 = i24 + i23 f64 = raw_load(i31, i40, descr=<ArrayF 8>) i65 = i40 + i39 f66 = f62 + f64 raw_store(i48, i59, f66, descr=<ArrayF 8>) i67 = i54 + 1 i68 = i59 + i58 i69 = i67 >= i60 guard(i69 is false) show bridge (run 2799 times, ~0%) (numpy_call2: no get_printable_location) jump(p0, f62, p2, p3, p55, i67, i68, p20, i18, p6, i63, p36, i34, i65, f64, i15, i23, i31, i39, i48, i58, i60, descr=TargetToken(41748336)) ``` Is this the trace that calculates a = b + c internally? This looks much more like what I have had in mind. Here some questions that I have: 1) That does the prefix p/i mean in the label arguments? I guess it is a box of i -> integer, p -> pointer, but why does raw_load(i31, ...) operate on an i31 instead of p31? 2) Why does every raw_load/raw_store has its own index calculation (e.g. i63 = i40 + i39) instead of just using one common index? 3) Are the numpy arrays or python arrays aligned in memory? If not would it be hard to change their alignment when they are allocated? Best, Richard On 02/24/2015 11:17 PM, Maciej Fijalkowski wrote:

On Wed, Feb 25, 2015 at 11:28 AM, Richard Plangger <rich@pasra.at> wrote:
yes
p is a gc pointer, i is an integer or raw pointer (we should invent new name for it, it's biting us in the JIT)
2) Why does every raw_load/raw_store has its own index calculation (e.g. i63 = i40 + i39) instead of just using one common index?
because we did not optimize it correctly ;-) it's about specializing the loop and detecting you can reuse the same indexing
3) Are the numpy arrays or python arrays aligned in memory? If not would it be hard to change their alignment when they are allocated?
we control the alignment (I think they're 8 byte aligned now, but can be made 16). Remember that numpy arrays don't have to be contiguous so you have to be a bit careful
participants (5)
-
Armin Rigo
-
Bengt Richter
-
Maciej Fijalkowski
-
Richard Plangger
-
Vincent Legoll