> clojure.examples.factorial=>  (dis.dis *)
>    0           0 LOAD_FAST                0 (__argsv__)
>                3 LOAD_ATTR                0 (__len__)
>                6 CALL_FUNCTION            0

I didn't look in depth at the bytecode produced by your compiler, but this is 
very sub-optimal.
In pypy we have a custom opcode to call methods, which is much faster than 
LOAD_ATTR/CALL_FUNCTION. See e.g. how this piece of code gets compiled:

 >>>> def foo(x):
....     return foo.__len__()
 >>>> import dis
 >>>> dis.dis(foo)
   2           0 LOAD_GLOBAL              0 (foo)
               3 LOOKUP_METHOD            1 (__len__)
               6 CALL_METHOD              0
               9 RETURN_VALUE

In general, I suggest to use the jitviewer to look at which code the JIT 
generates: it shows you how many low level operations are emitted for each 
opcode, and you can compare with the same algorithm written in python to see 
what causes the most slowdown.


