From d.s.seljebotn at astro.uio.no Fri Jun 1 15:49:21 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Fri, 01 Jun 2012 15:49:21 +0200 Subject: [Cython] SEP 201 draft: Native callable objects In-Reply-To: References: <4FC77A5C.50009@astro.uio.no> <4FC7C6A4.3060404@astro.uio.no> Message-ID: <4FC8C861.5040509@astro.uio.no> On 05/31/2012 10:13 PM, Robert Bradshaw wrote: > On Thu, May 31, 2012 at 12:29 PM, Dag Sverre Seljebotn > wrote: >> On 05/31/2012 08:50 PM, Robert Bradshaw wrote: >>> >>> On Thu, May 31, 2012 at 7:04 AM, Dag Sverre Seljebotn >>> wrote: >>>> >>>> [Discussion on numfocus at googlegroups.com please] >>>> >>>> I've uploaded a draft-state SEP 201 (previously CEP 1000): >>>> >>>> https://github.com/numfocus/sep/blob/master/sep201.rst >>>> >>>> """ >>>> Many callable objects are simply wrappers around native code. This holds >>>> for >>>> any Cython function, f2py functions, manually written CPython extensions, >>>> Numba, etc. >>>> >>>> Obviously, when native code calls other native code, it would be nice to >>>> skip the significant cost of boxing and unboxing all the arguments. >>>> """ >>>> >>>> >>>> The thread about this on the Cython list is almost endless: >>>> >>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>> >>>> There was a long discussion on the key-comparison vs. interned-string >>>> approach. I've written both up in SEP 201 since it was the major point of >>>> contention. There was some benchmarks starting here: >>>> >>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>> >>>> And why provide a table and not a get_function_pointer starting here: >>>> >>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>> >>>> For those who followed that and don't want to read the entire spec, the >>>> aspect of flags is new. How do we avoid to duplicate entries/check >>>> against >>>> two signatures for cases like a GIL-holding caller wanting to call a >>>> nogil >>>> function? My take: For key-comparison you can compare under a mask, for >>>> interned-string we should have additional flags field. >>>> >>>> The situation is a bit awkward: The Cython list consensus (well, me and >>>> Robert Bradshaw) decided on what is "Approach 1" (key-comparison) in SEP >>>> 201. I pushed for that. >>>> >>>> Still, now that a month has passed, I just think key-comparison is too >>>> ugly, >>>> and that the interning mechanism shouldn't be *that* hard to code up, >>>> probably 500 lines of C code if one just requires the GIL in a first >>>> iteration, and that keeping the spec simpler is more important. >>>> >>>> So I'm tentatively proposing Approach 2. >>> >>> >>> I'm still not convinced that a hybrid approach, where signatures below >>> some cutoff are compiled down to keys, is not a worthwhile approach. >>> This gets around variable-length keys (both the complexity and >>> possible runtime costs for long keys) and allows simple libraries to >>> produce and consume fast callables without participating in the >>> interning mechanism. >> >> I still think this gives us the "worst of both worlds", all the >> disadvantages and none of the advantages. > > It avoids the one of the primary disadvantage of keys, namely the > variable length complexity. > >> How many simple libraries are there really? Cython on one end, the >> magnificently complicated NumPy ufuncs on the other? Thinking big, perhaps >> PyPy and Julia? Cython, PyPy, Julia would all have to deal with long >> signatures anyway. And NumPy ufuncs are already complicated so even more >> low-level stuff wouldn't hurt. > > I was thinking of, for example, a differential equation solver written > in C, C++, or Fortran that could take a PyNativeCallableTable* > directly, primarily avoiding welding this spec to Python. I'm not sure how real-world that is in the end. But, the size of Cython generated code would be kept down for most modules as it wouldn't need to bundle an interner. AND, a problem with interning is spreading the signature strings all over memory (in the event you actually need to look at the contents). With a smart interner I guess this can be eliminated to some extent, but much better if one doesn't have to worry, and if all short signatures are keys you don't. Playing along: a) It'd be very nice to avoid explicit decoding. I think one should be able to cast the key to char[]; this a) avoids having to allocate a buffer on the stack to pass to a Decode function, b) let's you inspect the table in a debugger easily. b) Flags are needed in addition to interning; GIL status and exception return values do not require exact matches. I think more than 3 bits are needed for flags => our minimal padded table entry size is actually 24 bytes! (And this is OK, my benchmarks weren't affected by 8-byte vs 16-byte comparisons, branching is so dominating.) Now, 16 bits seems about right for flags, so this means we can actually for free use 14-char keys (12 for signature data, one for \0, one for guard) That pushes the number of non-interning signatures high enough to make it really fit 95% of the use-cases. I feel 6 chars is a little low, remember that a "pointer to a double complex" is "&Zd" by itself unless we play with encoding. BUT, it then gets rather complicated to have things work on little-endian vs. big-endian though as the guard byte must be in different positions. If you want to align the pointer you get this: typedef struct { void *funcptr; union { union { struct { uint16_t interned_flags; uint16_t padding1; uint32_t padding2; uint64_t interned_sig; }; struct { uint16_t flags; char sig[14]; }; } big_endian; union { struct { uint64_t interned_sig; uint32_t padding1; uint16_t padding2; uint16_t interned_flags; }; struct { char sig[14]; uint16_t flags; }; } little_endian; }; }; (interned_flags and flags is really the same, I just didn't want to mess with the struct alignment) So I think this is *almost* there, but it certainly gets complicated because of endianness issues. Of course, an alternative is to not have the interned_sig be 64-bit aligned. Or, play with adapting the string/guard bytes in the middle, but that sort of breaks a) above. Thinking about this is psychologically difficult because it's very likely bikeshedding, but OTOH once the spec is out in the wild it will never be worth it to change it so some care is called for... oh well, at least I'm having fun! Dag From d.s.seljebotn at astro.uio.no Fri Jun 1 16:25:18 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Fri, 01 Jun 2012 16:25:18 +0200 Subject: [Cython] SEP 201 draft: Native callable objects In-Reply-To: <4FC8C861.5040509@astro.uio.no> References: <4FC77A5C.50009@astro.uio.no> <4FC7C6A4.3060404@astro.uio.no> <4FC8C861.5040509@astro.uio.no> Message-ID: <4FC8D0CE.60903@astro.uio.no> On 06/01/2012 03:49 PM, Dag Sverre Seljebotn wrote: > On 05/31/2012 10:13 PM, Robert Bradshaw wrote: >> On Thu, May 31, 2012 at 12:29 PM, Dag Sverre Seljebotn >> wrote: >>> On 05/31/2012 08:50 PM, Robert Bradshaw wrote: >>>> >>>> On Thu, May 31, 2012 at 7:04 AM, Dag Sverre Seljebotn >>>> wrote: >>>>> >>>>> [Discussion on numfocus at googlegroups.com please] >>>>> >>>>> I've uploaded a draft-state SEP 201 (previously CEP 1000): >>>>> >>>>> https://github.com/numfocus/sep/blob/master/sep201.rst >>>>> >>>>> """ >>>>> Many callable objects are simply wrappers around native code. This >>>>> holds >>>>> for >>>>> any Cython function, f2py functions, manually written CPython >>>>> extensions, >>>>> Numba, etc. >>>>> >>>>> Obviously, when native code calls other native code, it would be >>>>> nice to >>>>> skip the significant cost of boxing and unboxing all the arguments. >>>>> """ >>>>> >>>>> >>>>> The thread about this on the Cython list is almost endless: >>>>> >>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>>> >>>>> >>>>> There was a long discussion on the key-comparison vs. interned-string >>>>> approach. I've written both up in SEP 201 since it was the major >>>>> point of >>>>> contention. There was some benchmarks starting here: >>>>> >>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>>> >>>>> >>>>> And why provide a table and not a get_function_pointer starting here: >>>>> >>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443 >>>>> >>>>> >>>>> For those who followed that and don't want to read the entire spec, >>>>> the >>>>> aspect of flags is new. How do we avoid to duplicate entries/check >>>>> against >>>>> two signatures for cases like a GIL-holding caller wanting to call a >>>>> nogil >>>>> function? My take: For key-comparison you can compare under a mask, >>>>> for >>>>> interned-string we should have additional flags field. >>>>> >>>>> The situation is a bit awkward: The Cython list consensus (well, me >>>>> and >>>>> Robert Bradshaw) decided on what is "Approach 1" (key-comparison) >>>>> in SEP >>>>> 201. I pushed for that. >>>>> >>>>> Still, now that a month has passed, I just think key-comparison is too >>>>> ugly, >>>>> and that the interning mechanism shouldn't be *that* hard to code up, >>>>> probably 500 lines of C code if one just requires the GIL in a first >>>>> iteration, and that keeping the spec simpler is more important. >>>>> >>>>> So I'm tentatively proposing Approach 2. >>>> >>>> >>>> I'm still not convinced that a hybrid approach, where signatures below >>>> some cutoff are compiled down to keys, is not a worthwhile approach. >>>> This gets around variable-length keys (both the complexity and >>>> possible runtime costs for long keys) and allows simple libraries to >>>> produce and consume fast callables without participating in the >>>> interning mechanism. >>> >>> I still think this gives us the "worst of both worlds", all the >>> disadvantages and none of the advantages. >> >> It avoids the one of the primary disadvantage of keys, namely the >> variable length complexity. >> >>> How many simple libraries are there really? Cython on one end, the >>> magnificently complicated NumPy ufuncs on the other? Thinking big, >>> perhaps >>> PyPy and Julia? Cython, PyPy, Julia would all have to deal with long >>> signatures anyway. And NumPy ufuncs are already complicated so even more >>> low-level stuff wouldn't hurt. >> >> I was thinking of, for example, a differential equation solver written >> in C, C++, or Fortran that could take a PyNativeCallableTable* >> directly, primarily avoiding welding this spec to Python. > > I'm not sure how real-world that is in the end. But, the size of Cython > generated code would be kept down for most modules as it wouldn't need > to bundle an interner. > > AND, a problem with interning is spreading the signature strings all > over memory (in the event you actually need to look at the contents). > With a smart interner I guess this can be eliminated to some extent, but > much better if one doesn't have to worry, and if all short signatures > are keys you don't. > > Playing along: > > a) It'd be very nice to avoid explicit decoding. I think one should be > able to cast the key to char[]; this a) avoids having to allocate a > buffer on the stack to pass to a Decode function, b) let's you inspect > the table in a debugger easily. > > b) Flags are needed in addition to interning; GIL status and exception > return values do not require exact matches. I think more than 3 bits are > needed for flags => our minimal padded table entry size is actually 24 > bytes! (And this is OK, my benchmarks weren't affected by 8-byte vs > 16-byte comparisons, branching is so dominating.) > > Now, 16 bits seems about right for flags, so this means we can actually > for free use 14-char keys (12 for signature data, one for \0, one for > guard) > > That pushes the number of non-interning signatures high enough to make > it really fit 95% of the use-cases. I feel 6 chars is a little low, > remember that a "pointer to a double complex" is "&Zd" by itself unless > we play with encoding. > > BUT, it then gets rather complicated to have things work on > little-endian vs. big-endian though as the guard byte must be in > different positions. If you want to align the pointer you get this: > > typedef struct { > void *funcptr; > union { > union { > struct { > uint16_t interned_flags; > uint16_t padding1; > uint32_t padding2; > uint64_t interned_sig; > }; > struct { > uint16_t flags; > char sig[14]; > }; > } big_endian; > > union { > struct { > uint64_t interned_sig; > uint32_t padding1; > uint16_t padding2; > uint16_t interned_flags; > }; > struct { > char sig[14]; > uint16_t flags; > }; > } little_endian; > }; > }; > > (interned_flags and flags is really the same, I just didn't want to mess > with the struct alignment) > > So I think this is *almost* there, but it certainly gets complicated > because of endianness issues. > > Of course, an alternative is to not have the interned_sig be 64-bit > aligned. Or, play with adapting the string/guard bytes in the middle, > but that sort of breaks a) above. OK, now I feel silly. If we need flags (as I believe we do), and the flag-containing quadword is being compared anyway, there's no reason at all to play tricks with aligned pointers and guard bytes. Simplest approach with a 128-bit compare (which, as I said, doesn't hurt one bit and may be needed anyway to filter on GIL-ness), is then struct { union { char *interned_sig; char signature[8] }; uint64_t flags; // first 8 bits always 0, for terminating \0 void *funcptr; }; One could also complicate this again to eat a few more flag bits for signature chars... Dag From d.s.seljebotn at astro.uio.no Mon Jun 4 21:44:11 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Mon, 04 Jun 2012 21:44:11 +0200 Subject: [Cython] Hash-based vtables Message-ID: <4FCD100B.7000008@astro.uio.no> Me and Robert had a long discussion on the NumFOCUS list about this already, but I figured it was better to continue it and provide more in-depth benchmark results here. It's basically a new idea of how to provide a vtable based on perfect hashing, which should be a lot simpler to implement than what I first imagined. I'll write down some context first, if you're familiar with this skip ahead a bit.. This means that you can do fast dispatches *without* the messy business of binding vtable slots at compile time. To be concrete, this might e.g. take the form def f(obj): obj.method(3.4) # try to find a vtable with "void method(double)" in it or, a more typed approach, # File A cdef class MyImpl: def double method(double x): return x * x # File B # Here we never know about MyImpl, hence "duck-typed" @cython.interface class MyIntf: def double method(double x): pass def f(MyIntf obj): # obj *can* be MyImpl instance, or whatever else that supports # that interface obj.method(3.4) Now, the idea to implement this is: a) Both caller and callee pre-hash name/argument string "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of md5) b) Callee (MyImpl) generates a vtable of its methods by *perfect* hashing. What you do is define a final hash fh as a function of the pre-hash ph, for instance fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & vtable.m (Me and Robert are benchmarking different functions to use here.) By playing with r1, r2, r3, you have 64**3 choices of hash function, and will be able to pick a combination which gives *no* (or very few) collisions. c) Caller then combines the pre-hash generated at compile-time, with r1, r2, r3, m stored in the vtable header, in order to find the final location in the hash-table. The exciting thing is that in benchmark, the performance penalty is actually very slight over a C++-style v-table. (Of course you can cache a proper vtable, but the fact that you get so close without caring about caching means that this can be done much faster.) Back to my and Robert's discussion on benchmarks: I've uploaded benchmarks here: https://github.com/dagss/hashvtable/tree/master/dispatchbench I've changed the benchmark taking to give more robust numbers (at least for me), you want to look at the 'min' column. I changed the benchmark a bit so that it benchmarks a *callsite*. So we don't pass 'h' on the stack, but either a) looks it up in a global variable (default), or b) it's a compile-time constant (immediate in assembly) (compile with -DIMHASH). Similarly, the ID is either an "interned" global variable, or an immediate (-DIMID). The results are very different on my machine depending on this aspect. My conclusions: - Both three shifts with masking, two shifts with a "fallback slot" (allowing for a single collision), three shifts, two shifts with two masks allows for constructing good vtables. In the case of only two shifts, one colliding method gets the twoshift+fback performance and the rest gets the twoshift performance. - Performance is really more affected by whether hashes are immediates or global variables than the hash function. This is in contrast to the interning vs. key benchmarks -- so I think that if we looked up the vtable through PyTypeObject, rather than getting the vtable directly, the loads of the global variables could potentially be masked by that. - My conclusion: Just use lower bits of md5 *both* for the hashing and the ID-ing (don't bother with any interning), and compile the thing as a 64-bit immediate. This can cause crashes/stack smashes etc. if there's lower-64bit-of-md5 collisions, but a) the probability is incredibly small, b) it would only matter in situations that should cause an AttributeError anyway, c) if we really care, we can always use an interning-like mechanism to validate on module loading that its hashes doesn't collide with other hashes (and raise an exception "Congratulations, you've discovered a phenomenal md5 collision, get in touch with cython devs and we'll work around it right away"). The RTTI (i.e. the char*) is also put in there, but is not used for comparison and is not interned. At least, that's what I think we should do for duck-style vtables. Do we then go to all the pain of defining key-encoding, interning etc. just for SEP 201? Isn't it easier to just mandate a md5 dependency and be done with it? (After all, md5 usually comes with Python in the md5 and hashlib modules) direct: Early-binding index: Call slot 0 (C++-style vtable/function pointer) noshift: h & m1 oneshift: (h >> r1) & m1 twoshift: ((h >> r1) ^ (h >> r2)) & m1 twoshift+fback: hash doesn't threeshift: ((h >> r1) ^ (h >> r2) ^ (h >> r3)) & m1 doublemask: ((h >> r1) & m1) ^ ((h >> r2) & m2) doublemask2: ((h >> r1) & m1) ^ ((h & m2) >> r2) Default distutils build (-O2): ------------------------------ Hash globalvar, ids globalvar direct: min=5.38e-09 mean=5.45e-09 std=3.79e-11 val=1600000000.000000 index: min=5.38e-09 mean=5.44e-09 std=3.09e-11 val=1200000000.000000 noshift: min=5.99e-09 mean=6.14e-09 std=6.63e-11 val=1200000000.000000 oneshift: min=6.47e-09 mean=6.53e-09 std=3.21e-11 val=1200000000.000000 twoshift: min=7.00e-09 mean=7.08e-09 std=3.73e-11 val=1200000000.000000 twoshift+fback: min=7.54e-09 mean=7.61e-09 std=4.46e-11 val=1200000000.000000 threeshift: min=7.54e-09 mean=7.64e-09 std=3.79e-11 val=1200000000.000000 doublemask: min=7.56e-09 mean=7.64e-09 std=3.02e-11 val=1200000000.000000 doublemask2: min=7.55e-09 mean=7.62e-09 std=3.24e-11 val=1200000000.000000 hash immediate, ids globalvar direct: min=5.38e-09 mean=5.45e-09 std=3.87e-11 val=1600000000.000000 index: min=5.40e-09 mean=5.45e-09 std=2.92e-11 val=1200000000.000000 noshift: min=5.38e-09 mean=5.44e-09 std=3.48e-11 val=1200000000.000000 oneshift: min=5.90e-09 mean=5.99e-09 std=4.05e-11 val=1200000000.000000 twoshift: min=6.09e-09 mean=6.17e-09 std=3.52e-11 val=1200000000.000000 twoshift+fback: min=7.00e-09 mean=7.08e-09 std=3.64e-11 val=1200000000.000000 threeshift: min=6.47e-09 mean=6.55e-09 std=6.04e-11 val=1200000000.000000 doublemask: min=6.46e-09 mean=6.50e-09 std=3.37e-11 val=1200000000.000000 doublemask2: min=6.46e-09 mean=6.51e-09 std=3.04e-11 val=1200000000.000000 all immediate: direct: min=5.39e-09 mean=5.50e-09 std=5.22e-11 val=1600000000.000000 index: min=5.38e-09 mean=5.51e-09 std=6.25e-11 val=1200000000.000000 noshift: min=5.38e-09 mean=5.51e-09 std=6.90e-11 val=1200000000.000000 oneshift: min=5.40e-09 mean=5.51e-09 std=5.35e-11 val=1200000000.000000 twoshift: min=5.94e-09 mean=6.06e-09 std=5.91e-11 val=1200000000.000000 twoshift+fback: min=7.06e-09 mean=7.19e-09 std=5.39e-11 val=1200000000.000000 threeshift: min=5.96e-09 mean=6.07e-09 std=5.54e-11 val=1200000000.000000 doublemask: min=5.88e-09 mean=6.01e-09 std=6.06e-11 val=1200000000.000000 doublemask2: min=5.94e-09 mean=6.05e-09 std=6.16e-11 val=1200000000.000000 -O3 build --------- all globalvars: direct: min=1.61e-09 mean=1.63e-09 std=1.40e-11 val=1600000000.000000 index: min=5.38e-09 mean=5.43e-09 std=2.82e-11 val=1200000000.000000 noshift: min=6.04e-09 mean=6.13e-09 std=4.76e-11 val=1200000000.000000 oneshift: min=6.46e-09 mean=6.54e-09 std=3.82e-11 val=1200000000.000000 twoshift: min=7.01e-09 mean=7.06e-09 std=3.41e-11 val=1200000000.000000 twoshift+fback: min=7.57e-09 mean=7.64e-09 std=3.47e-11 val=1200000000.000000 threeshift: min=7.54e-09 mean=7.63e-09 std=4.17e-11 val=1200000000.000000 doublemask: min=7.54e-09 mean=7.61e-09 std=3.64e-11 val=1200000000.000000 doublemask2: min=7.55e-09 mean=7.63e-09 std=3.35e-11 val=1200000000.000000 hash immediate, ids globalvar: direct: min=1.61e-09 mean=1.66e-09 std=3.30e-11 val=1600000000.000000 index: min=5.40e-09 mean=5.50e-09 std=4.94e-11 val=1200000000.000000 noshift: min=5.38e-09 mean=5.49e-09 std=6.02e-11 val=1200000000.000000 oneshift: min=5.95e-09 mean=6.06e-09 std=6.64e-11 val=1200000000.000000 twoshift: min=5.96e-09 mean=6.13e-09 std=7.22e-11 val=1200000000.000000 twoshift+fback: min=7.02e-09 mean=7.18e-09 std=7.04e-11 val=1200000000.000000 threeshift: min=6.52e-09 mean=6.65e-09 std=6.43e-11 val=1200000000.000000 doublemask: min=6.50e-09 mean=6.62e-09 std=5.28e-11 val=1200000000.000000 doublemask2: min=6.52e-09 mean=6.63e-09 std=5.23e-11 val=1200000000.000000 all immediate: direct: min=1.61e-09 mean=1.62e-09 std=9.77e-12 val=1600000000.000000 index: min=5.38e-09 mean=5.39e-09 std=1.71e-11 val=1200000000.000000 noshift: min=5.38e-09 mean=5.40e-09 std=2.41e-11 val=1200000000.000000 oneshift: min=5.38e-09 mean=5.40e-09 std=1.81e-11 val=1200000000.000000 twoshift: min=5.92e-09 mean=5.93e-09 std=1.43e-11 val=1200000000.000000 twoshift+fback: min=7.00e-09 mean=7.01e-09 std=2.20e-11 val=1200000000.000000 threeshift: min=5.92e-09 mean=5.94e-09 std=1.99e-11 val=1200000000.000000 doublemask: min=5.79e-09 mean=5.82e-09 std=2.32e-11 val=1200000000.000000 doublemask2: min=5.92e-09 mean=5.94e-09 std=2.25e-11 val=1200000000.000000 From d.s.seljebotn at astro.uio.no Mon Jun 4 22:55:56 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Mon, 04 Jun 2012 22:55:56 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCD100B.7000008@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> Message-ID: <4FCD20DC.6090906@astro.uio.no> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: > Me and Robert had a long discussion on the NumFOCUS list about this > already, but I figured it was better to continue it and provide more > in-depth benchmark results here. > > It's basically a new idea of how to provide a vtable based on perfect > hashing, which should be a lot simpler to implement than what I first > imagined. > > I'll write down some context first, if you're familiar with this > skip ahead a bit.. > > This means that you can do fast dispatches *without* the messy > business of binding vtable slots at compile time. To be concrete, this > might e.g. take the form > > def f(obj): > obj.method(3.4) # try to find a vtable with "void method(double)" in it > > or, a more typed approach, > > # File A > cdef class MyImpl: > def double method(double x): return x * x > > # File B > # Here we never know about MyImpl, hence "duck-typed" > @cython.interface > class MyIntf: > def double method(double x): pass > > def f(MyIntf obj): > # obj *can* be MyImpl instance, or whatever else that supports > # that interface > obj.method(3.4) > > > Now, the idea to implement this is: > > a) Both caller and callee pre-hash name/argument string > "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of > md5) > > b) Callee (MyImpl) generates a vtable of its methods by *perfect* > hashing. What you do is define a final hash fh as a function > of the pre-hash ph, for instance > > fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & vtable.m > > (Me and Robert are benchmarking different functions to use here.) By > playing with r1, r2, r3, you have 64**3 choices of hash function, and > will be able to pick a combination which gives *no* (or very few) > collisions. > > c) Caller then combines the pre-hash generated at compile-time, with > r1, r2, r3, m stored in the vtable header, in order to find the > final location in the hash-table. > > The exciting thing is that in benchmark, the performance penalty is > actually very slight over a C++-style v-table. (Of course you can > cache a proper vtable, but the fact that you get so close without > caring about caching means that this can be done much faster.) > > Back to my and Robert's discussion on benchmarks: > > I've uploaded benchmarks here: > > https://github.com/dagss/hashvtable/tree/master/dispatchbench > > I've changed the benchmark taking to give more robust numbers (at > least for me), you want to look at the 'min' column. > > I changed the benchmark a bit so that it benchmarks a *callsite*. > So we don't pass 'h' on the stack, but either a) looks it up in a global > variable (default), or b) it's a compile-time constant (immediate in > assembly) (compile with -DIMHASH). > > Similarly, the ID is either an "interned" global variable, or an > immediate (-DIMID). > > The results are very different on my machine depending on this aspect. > My conclusions: > > - Both three shifts with masking, two shifts with a "fallback slot" > (allowing for a single collision), three shifts, two shifts with > two masks allows for constructing good vtables. In the case of only > two shifts, one colliding method gets the twoshift+fback > performance and the rest gets the twoshift performance. > > - Performance is really more affected by whether hashes are > immediates or global variables than the hash function. This is in > contrast to the interning vs. key benchmarks -- so I think that if > we looked up the vtable through PyTypeObject, rather than getting > the vtable directly, the loads of the global variables could > potentially be masked by that. > > - My conclusion: Just use lower bits of md5 *both* for the hashing > and the ID-ing (don't bother with any interning), and compile the > thing as a 64-bit immediate. This can cause crashes/stack smashes > etc. if there's lower-64bit-of-md5 collisions, but a) the > probability is incredibly small, b) it would only matter in > situations that should cause an AttributeError anyway, c) if we > really care, we can always use an interning-like mechanism to > validate on module loading that its hashes doesn't collide with > other hashes (and raise an exception "Congratulations, you've > discovered a phenomenal md5 collision, get in touch with cython > devs and we'll work around it right away"). What I forgot to mention: - I really want to avoid linear probing just because of the code bloat in call sites. With two shifts, when there was a failure to find a perfect hash it was always possible to find one with a single collision. - Probing for the hash with two shifts is lightning fast, it can take a while with three shifts (though you can always spend more memory on a bigger table to make it fast again). However, it makes me uneasy to penalize the performance of calling one of the random methods, so I'm really in favour of three-shifts or double-mask (to be decided when investigating the performance of probing for parameters in more detail). - I tried using SSE to do shifts in parallel and failed (miserable performance). The problem is quickly moving things between general purpose registers and SSE registers, and the lack of SSE immediates/constants in the instruction stream. At least, what my gcc 4.6 generates appeared to use the stack to communicate between SSE registers and general purpose registers (but I can't have been doing the right thing..). > > The RTTI (i.e. the char*) is also put in there, but is not used for > comparison and is not interned. > > At least, that's what I think we should do for duck-style vtables. > > Do we then go to all the pain of defining key-encoding, interning > etc. just for SEP 201? Isn't it easier to just mandate a md5 dependency > and be done with it? (After all, md5 usually comes with Python in the > md5 and hashlib modules) > > direct: Early-binding > index: Call slot 0 (C++-style vtable/function pointer) > noshift: h & m1 > oneshift: (h >> r1) & m1 > twoshift: ((h >> r1) ^ (h >> r2)) & m1 > twoshift+fback: hash doesn't I meant: Hash collision and then, after a branch miss, look up the one fallback slot in the vtable header. Dag > threeshift: ((h >> r1) ^ (h >> r2) ^ (h >> r3)) & m1 > doublemask: ((h >> r1) & m1) ^ ((h >> r2) & m2) > doublemask2: ((h >> r1) & m1) ^ ((h & m2) >> r2) > > Default distutils build (-O2): > ------------------------------ > > Hash globalvar, ids globalvar > > direct: min=5.38e-09 mean=5.45e-09 std=3.79e-11 val=1600000000.000000 > index: min=5.38e-09 mean=5.44e-09 std=3.09e-11 val=1200000000.000000 > noshift: min=5.99e-09 mean=6.14e-09 std=6.63e-11 val=1200000000.000000 > oneshift: min=6.47e-09 mean=6.53e-09 std=3.21e-11 val=1200000000.000000 > twoshift: min=7.00e-09 mean=7.08e-09 std=3.73e-11 val=1200000000.000000 > twoshift+fback: min=7.54e-09 mean=7.61e-09 std=4.46e-11 > val=1200000000.000000 > threeshift: min=7.54e-09 mean=7.64e-09 std=3.79e-11 val=1200000000.000000 > doublemask: min=7.56e-09 mean=7.64e-09 std=3.02e-11 val=1200000000.000000 > doublemask2: min=7.55e-09 mean=7.62e-09 std=3.24e-11 val=1200000000.000000 > > hash immediate, ids globalvar > > direct: min=5.38e-09 mean=5.45e-09 std=3.87e-11 val=1600000000.000000 > index: min=5.40e-09 mean=5.45e-09 std=2.92e-11 val=1200000000.000000 > noshift: min=5.38e-09 mean=5.44e-09 std=3.48e-11 val=1200000000.000000 > oneshift: min=5.90e-09 mean=5.99e-09 std=4.05e-11 val=1200000000.000000 > twoshift: min=6.09e-09 mean=6.17e-09 std=3.52e-11 val=1200000000.000000 > twoshift+fback: min=7.00e-09 mean=7.08e-09 std=3.64e-11 > val=1200000000.000000 > threeshift: min=6.47e-09 mean=6.55e-09 std=6.04e-11 val=1200000000.000000 > doublemask: min=6.46e-09 mean=6.50e-09 std=3.37e-11 val=1200000000.000000 > doublemask2: min=6.46e-09 mean=6.51e-09 std=3.04e-11 val=1200000000.000000 > > all immediate: > > direct: min=5.39e-09 mean=5.50e-09 std=5.22e-11 val=1600000000.000000 > index: min=5.38e-09 mean=5.51e-09 std=6.25e-11 val=1200000000.000000 > noshift: min=5.38e-09 mean=5.51e-09 std=6.90e-11 val=1200000000.000000 > oneshift: min=5.40e-09 mean=5.51e-09 std=5.35e-11 val=1200000000.000000 > twoshift: min=5.94e-09 mean=6.06e-09 std=5.91e-11 val=1200000000.000000 > twoshift+fback: min=7.06e-09 mean=7.19e-09 std=5.39e-11 > val=1200000000.000000 > threeshift: min=5.96e-09 mean=6.07e-09 std=5.54e-11 val=1200000000.000000 > doublemask: min=5.88e-09 mean=6.01e-09 std=6.06e-11 val=1200000000.000000 > doublemask2: min=5.94e-09 mean=6.05e-09 std=6.16e-11 val=1200000000.000000 > > -O3 build > --------- > > all globalvars: > > direct: min=1.61e-09 mean=1.63e-09 std=1.40e-11 val=1600000000.000000 > index: min=5.38e-09 mean=5.43e-09 std=2.82e-11 val=1200000000.000000 > noshift: min=6.04e-09 mean=6.13e-09 std=4.76e-11 val=1200000000.000000 > oneshift: min=6.46e-09 mean=6.54e-09 std=3.82e-11 val=1200000000.000000 > twoshift: min=7.01e-09 mean=7.06e-09 std=3.41e-11 val=1200000000.000000 > twoshift+fback: min=7.57e-09 mean=7.64e-09 std=3.47e-11 > val=1200000000.000000 > threeshift: min=7.54e-09 mean=7.63e-09 std=4.17e-11 val=1200000000.000000 > doublemask: min=7.54e-09 mean=7.61e-09 std=3.64e-11 val=1200000000.000000 > doublemask2: min=7.55e-09 mean=7.63e-09 std=3.35e-11 val=1200000000.000000 > > hash immediate, ids globalvar: > > direct: min=1.61e-09 mean=1.66e-09 std=3.30e-11 val=1600000000.000000 > index: min=5.40e-09 mean=5.50e-09 std=4.94e-11 val=1200000000.000000 > noshift: min=5.38e-09 mean=5.49e-09 std=6.02e-11 val=1200000000.000000 > oneshift: min=5.95e-09 mean=6.06e-09 std=6.64e-11 val=1200000000.000000 > twoshift: min=5.96e-09 mean=6.13e-09 std=7.22e-11 val=1200000000.000000 > twoshift+fback: min=7.02e-09 mean=7.18e-09 std=7.04e-11 > val=1200000000.000000 > threeshift: min=6.52e-09 mean=6.65e-09 std=6.43e-11 val=1200000000.000000 > doublemask: min=6.50e-09 mean=6.62e-09 std=5.28e-11 val=1200000000.000000 > doublemask2: min=6.52e-09 mean=6.63e-09 std=5.23e-11 val=1200000000.000000 > > all immediate: > > direct: min=1.61e-09 mean=1.62e-09 std=9.77e-12 val=1600000000.000000 > index: min=5.38e-09 mean=5.39e-09 std=1.71e-11 val=1200000000.000000 > noshift: min=5.38e-09 mean=5.40e-09 std=2.41e-11 val=1200000000.000000 > oneshift: min=5.38e-09 mean=5.40e-09 std=1.81e-11 val=1200000000.000000 > twoshift: min=5.92e-09 mean=5.93e-09 std=1.43e-11 val=1200000000.000000 > twoshift+fback: min=7.00e-09 mean=7.01e-09 std=2.20e-11 > val=1200000000.000000 > threeshift: min=5.92e-09 mean=5.94e-09 std=1.99e-11 val=1200000000.000000 > doublemask: min=5.79e-09 mean=5.82e-09 std=2.32e-11 val=1200000000.000000 > doublemask2: min=5.92e-09 mean=5.94e-09 std=2.25e-11 val=1200000000.000000 > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From robertwb at gmail.com Mon Jun 4 23:43:07 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Mon, 4 Jun 2012 14:43:07 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCD20DC.6090906@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> Message-ID: On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn wrote: > On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >> >> Me and Robert had a long discussion on the NumFOCUS list about this >> already, but I figured it was better to continue it and provide more >> in-depth benchmark results here. >> >> It's basically a new idea of how to provide a vtable based on perfect >> hashing, which should be a lot simpler to implement than what I first >> imagined. >> >> I'll write down some context first, if you're familiar with this >> skip ahead a bit.. >> >> This means that you can do fast dispatches *without* the messy >> business of binding vtable slots at compile time. To be concrete, this >> might e.g. take the form >> >> def f(obj): >> obj.method(3.4) # try to find a vtable with "void method(double)" in it >> >> or, a more typed approach, >> >> # File A >> cdef class MyImpl: >> def double method(double x): return x * x >> >> # File B >> # Here we never know about MyImpl, hence "duck-typed" >> @cython.interface >> class MyIntf: >> def double method(double x): pass >> >> def f(MyIntf obj): >> # obj *can* be MyImpl instance, or whatever else that supports >> # that interface >> obj.method(3.4) >> >> >> Now, the idea to implement this is: >> >> a) Both caller and callee pre-hash name/argument string >> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >> md5) >> >> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >> hashing. What you do is define a final hash fh as a function >> of the pre-hash ph, for instance >> >> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & >> vtable.m >> >> (Me and Robert are benchmarking different functions to use here.) By >> playing with r1, r2, r3, you have 64**3 choices of hash function, and >> will be able to pick a combination which gives *no* (or very few) >> collisions. >> >> c) Caller then combines the pre-hash generated at compile-time, with >> r1, r2, r3, m stored in the vtable header, in order to find the >> final location in the hash-table. >> >> The exciting thing is that in benchmark, the performance penalty is >> actually very slight over a C++-style v-table. (Of course you can >> cache a proper vtable, but the fact that you get so close without >> caring about caching means that this can be done much faster.) One advantage about caching a vtable is that one can possibly put in adapters for non-exact matches. It also opens up the possibility of putting in stubs to call def methods if they exist. This needs to be fleshed out more, (another CEP :) but could provide for a backwards-compatible easy first implementation. >> Back to my and Robert's discussion on benchmarks: >> >> I've uploaded benchmarks here: >> >> https://github.com/dagss/hashvtable/tree/master/dispatchbench >> >> I've changed the benchmark taking to give more robust numbers (at >> least for me), you want to look at the 'min' column. >> >> I changed the benchmark a bit so that it benchmarks a *callsite*. >> So we don't pass 'h' on the stack, but either a) looks it up in a global >> variable (default), or b) it's a compile-time constant (immediate in >> assembly) (compile with -DIMHASH). >> >> Similarly, the ID is either an "interned" global variable, or an >> immediate (-DIMID). >> >> The results are very different on my machine depending on this aspect. >> My conclusions: >> >> - Both three shifts with masking, two shifts with a "fallback slot" >> (allowing for a single collision), three shifts, two shifts with >> two masks allows for constructing good vtables. In the case of only >> two shifts, one colliding method gets the twoshift+fback >> performance and the rest gets the twoshift performance. >> >> - Performance is really more affected by whether hashes are >> immediates or global variables than the hash function. This is in >> contrast to the interning vs. key benchmarks -- so I think that if >> we looked up the vtable through PyTypeObject, rather than getting >> the vtable directly, the loads of the global variables could >> potentially be masked by that. >> >> - My conclusion: Just use lower bits of md5 *both* for the hashing >> and the ID-ing (don't bother with any interning), and compile the >> thing as a 64-bit immediate. This can cause crashes/stack smashes >> etc. if there's lower-64bit-of-md5 collisions, but a) the >> probability is incredibly small, b) it would only matter in >> situations that should cause an AttributeError anyway, c) if we >> really care, we can always use an interning-like mechanism to >> validate on module loading that its hashes doesn't collide with >> other hashes (and raise an exception "Congratulations, you've >> discovered a phenomenal md5 collision, get in touch with cython >> devs and we'll work around it right away"). Due to the birthday paradox, this seems a bit risky. Maybe it's because I regularly work with collections much bigger than 2^32, and I suppose we're talking about unique method names and signatures here, but still... I wonder what the penalty would be for checking the full 128 bit hash. (Storing it could allow for greater entropy in the optimal hash table search as well). > What I forgot to mention: > > ?- I really want to avoid linear probing just because of the code bloat in > call sites. That's a good point. What about flags--are we throwing out the idea of masking? > With two shifts, when there was a failure to find a perfect hash > it was always possible to find one with a single collision. > > ?- Probing for the hash with two shifts is lightning fast, it can take a > while with three shifts (though you can always spend more memory on a bigger > table to make it fast again). However, it makes me uneasy to penalize the > performance of calling one of the random methods, so I'm really in favour of > three-shifts or double-mask (to be decided when investigating the > performance of probing for parameters in more detail). > > ?- I tried using SSE to do shifts in parallel and failed (miserable > performance). The problem is quickly moving things between general purpose > registers and SSE registers, and the lack of SSE immediates/constants in the > instruction stream. At least, what my gcc 4.6 generates appeared to use the > stack to communicate between SSE registers and general purpose registers > (but I can't have been doing the right thing..). > > > >> >> The RTTI (i.e. the char*) is also put in there, but is not used for >> comparison and is not interned. >> >> At least, that's what I think we should do for duck-style vtables. >> >> Do we then go to all the pain of defining key-encoding, interning >> etc. just for SEP 201? Isn't it easier to just mandate a md5 dependency >> and be done with it? (After all, md5 usually comes with Python in the >> md5 and hashlib modules) >> >> direct: Early-binding >> index: Call slot 0 (C++-style vtable/function pointer) >> noshift: h & m1 >> oneshift: (h >> r1) & m1 >> twoshift: ((h >> r1) ^ (h >> r2)) & m1 >> twoshift+fback: hash doesn't > > > I meant: Hash collision and then, after a branch miss, look up the one > fallback slot in the vtable header. We could also do a fallback table. Usually it'd be empty, Occasionally it'd have one element in it. It'd always be possible to make this big enough to avoid collisions in a worst-case scenario. BTW, this is a general static char* -> void* dictionary, I bet it could possibly have other uses. (It may also be a well-studied problem, though a bit hard to search for...) I suppose we could reduce it to read-optimized int -> int mappings. - Robert From d.s.seljebotn at astro.uio.no Tue Jun 5 00:07:30 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 00:07:30 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> Message-ID: <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> Robert Bradshaw wrote: >On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn > wrote: >> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>> >>> Me and Robert had a long discussion on the NumFOCUS list about this >>> already, but I figured it was better to continue it and provide more >>> in-depth benchmark results here. >>> >>> It's basically a new idea of how to provide a vtable based on >perfect >>> hashing, which should be a lot simpler to implement than what I >first >>> imagined. >>> >>> I'll write down some context first, if you're familiar with this >>> skip ahead a bit.. >>> >>> This means that you can do fast dispatches *without* the messy >>> business of binding vtable slots at compile time. To be concrete, >this >>> might e.g. take the form >>> >>> def f(obj): >>> obj.method(3.4) # try to find a vtable with "void method(double)" in >it >>> >>> or, a more typed approach, >>> >>> # File A >>> cdef class MyImpl: >>> def double method(double x): return x * x >>> >>> # File B >>> # Here we never know about MyImpl, hence "duck-typed" >>> @cython.interface >>> class MyIntf: >>> def double method(double x): pass >>> >>> def f(MyIntf obj): >>> # obj *can* be MyImpl instance, or whatever else that supports >>> # that interface >>> obj.method(3.4) >>> >>> >>> Now, the idea to implement this is: >>> >>> a) Both caller and callee pre-hash name/argument string >>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>> md5) >>> >>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>> hashing. What you do is define a final hash fh as a function >>> of the pre-hash ph, for instance >>> >>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & >>> vtable.m >>> >>> (Me and Robert are benchmarking different functions to use here.) By >>> playing with r1, r2, r3, you have 64**3 choices of hash function, >and >>> will be able to pick a combination which gives *no* (or very few) >>> collisions. >>> >>> c) Caller then combines the pre-hash generated at compile-time, with >>> r1, r2, r3, m stored in the vtable header, in order to find the >>> final location in the hash-table. >>> >>> The exciting thing is that in benchmark, the performance penalty is >>> actually very slight over a C++-style v-table. (Of course you can >>> cache a proper vtable, but the fact that you get so close without >>> caring about caching means that this can be done much faster.) > >One advantage about caching a vtable is that one can possibly put in >adapters for non-exact matches. It also opens up the possibility of >putting in stubs to call def methods if they exist. This needs to be >fleshed out more, (another CEP :) but could provide for a >backwards-compatible easy first implementation. > >>> Back to my and Robert's discussion on benchmarks: >>> >>> I've uploaded benchmarks here: >>> >>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>> >>> I've changed the benchmark taking to give more robust numbers (at >>> least for me), you want to look at the 'min' column. >>> >>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>> So we don't pass 'h' on the stack, but either a) looks it up in a >global >>> variable (default), or b) it's a compile-time constant (immediate in >>> assembly) (compile with -DIMHASH). >>> >>> Similarly, the ID is either an "interned" global variable, or an >>> immediate (-DIMID). >>> >>> The results are very different on my machine depending on this >aspect. >>> My conclusions: >>> >>> - Both three shifts with masking, two shifts with a "fallback slot" >>> (allowing for a single collision), three shifts, two shifts with >>> two masks allows for constructing good vtables. In the case of only >>> two shifts, one colliding method gets the twoshift+fback >>> performance and the rest gets the twoshift performance. >>> >>> - Performance is really more affected by whether hashes are >>> immediates or global variables than the hash function. This is in >>> contrast to the interning vs. key benchmarks -- so I think that if >>> we looked up the vtable through PyTypeObject, rather than getting >>> the vtable directly, the loads of the global variables could >>> potentially be masked by that. >>> >>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>> and the ID-ing (don't bother with any interning), and compile the >>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>> probability is incredibly small, b) it would only matter in >>> situations that should cause an AttributeError anyway, c) if we >>> really care, we can always use an interning-like mechanism to >>> validate on module loading that its hashes doesn't collide with >>> other hashes (and raise an exception "Congratulations, you've >>> discovered a phenomenal md5 collision, get in touch with cython >>> devs and we'll work around it right away"). > >Due to the birthday paradox, this seems a bit risky. Maybe it's >because I regularly work with collections much bigger than 2^32, and I >suppose we're talking about unique method names and signatures here, >but still... I wonder what the penalty would be for checking the full >128 bit hash. (Storing it could allow for greater entropy in the >optimal hash table search as well). > >> What I forgot to mention: >> >> ?- I really want to avoid linear probing just because of the code >bloat in >> call sites. > >That's a good point. What about flags--are we throwing out the idea of >masking? > >> With two shifts, when there was a failure to find a perfect hash >> it was always possible to find one with a single collision. >> >> ?- Probing for the hash with two shifts is lightning fast, it can >take a >> while with three shifts (though you can always spend more memory on a >bigger >> table to make it fast again). However, it makes me uneasy to penalize >the >> performance of calling one of the random methods, so I'm really in >favour of >> three-shifts or double-mask (to be decided when investigating the >> performance of probing for parameters in more detail). >> >> ?- I tried using SSE to do shifts in parallel and failed (miserable >> performance). The problem is quickly moving things between general >purpose >> registers and SSE registers, and the lack of SSE immediates/constants >in the >> instruction stream. At least, what my gcc 4.6 generates appeared to >use the >> stack to communicate between SSE registers and general purpose >registers >> (but I can't have been doing the right thing..). >> >> >> >>> >>> The RTTI (i.e. the char*) is also put in there, but is not used for >>> comparison and is not interned. >>> >>> At least, that's what I think we should do for duck-style vtables. >>> >>> Do we then go to all the pain of defining key-encoding, interning >>> etc. just for SEP 201? Isn't it easier to just mandate a md5 >dependency >>> and be done with it? (After all, md5 usually comes with Python in >the >>> md5 and hashlib modules) >>> >>> direct: Early-binding >>> index: Call slot 0 (C++-style vtable/function pointer) >>> noshift: h & m1 >>> oneshift: (h >> r1) & m1 >>> twoshift: ((h >> r1) ^ (h >> r2)) & m1 >>> twoshift+fback: hash doesn't >> >> >> I meant: Hash collision and then, after a branch miss, look up the >one >> fallback slot in the vtable header. > >We could also do a fallback table. Usually it'd be empty, Occasionally >it'd have one element in it. It'd always be possible to make this big >enough to avoid collisions in a worst-case scenario. > >BTW, this is a general static char* -> void* dictionary, I bet it >could possibly have other uses. (It may also be a well-studied >problem, though a bit hard to search for...) I suppose we could reduce >it to read-optimized int -> int mappings. The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. >From the interning/key benchmarks, checking the full 128 bits would probably not be noticeable in microbenchmarks, it's more about using an extra register and bloating the instruction cache and data cache a bit etc, stuff that can only be measured in production. The alternative is having a collision detection registry. If it complains, you're told where to edit Cython (perhaps a datafile) so that the pre-hash function changes: if signature equals 'foo:ddffi' # known collision with 'bar:ii' Use high 64 bits of md5 Else: Use low 64 bits of md5 Each such collision is documented in the cep/sep. But 128 bit and then relying on luck is perhaps simpler... If we need flags, lets say that 92 bits suffice. for hash and use 16 for flags... But i was thinking that you'd have separate tables for nogil callers and gil-holding callers so that you didn't need to scan for matching flags. We really want this to be branch-miss-free. Still, flags are good for error return codes etc. Do you agree on forgetting about the encoded keys/interning even for SEP 201? There's only so much effort to go around and I'd much rather use md5 and these hash tables everywhere. Dag > >- Robert >_______________________________________________ >cython-devel mailing list >cython-devel at python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From robertwb at gmail.com Tue Jun 5 00:30:38 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Mon, 4 Jun 2012 15:30:38 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> Message-ID: On Mon, Jun 4, 2012 at 3:07 PM, Dag Sverre Seljebotn wrote: > > > Robert Bradshaw wrote: > >>On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>>> >>>> Me and Robert had a long discussion on the NumFOCUS list about this >>>> already, but I figured it was better to continue it and provide more >>>> in-depth benchmark results here. >>>> >>>> It's basically a new idea of how to provide a vtable based on >>perfect >>>> hashing, which should be a lot simpler to implement than what I >>first >>>> imagined. >>>> >>>> I'll write down some context first, if you're familiar with this >>>> skip ahead a bit.. >>>> >>>> This means that you can do fast dispatches *without* the messy >>>> business of binding vtable slots at compile time. To be concrete, >>this >>>> might e.g. take the form >>>> >>>> def f(obj): >>>> obj.method(3.4) # try to find a vtable with "void method(double)" in >>it >>>> >>>> or, a more typed approach, >>>> >>>> # File A >>>> cdef class MyImpl: >>>> def double method(double x): return x * x >>>> >>>> # File B >>>> # Here we never know about MyImpl, hence "duck-typed" >>>> @cython.interface >>>> class MyIntf: >>>> def double method(double x): pass >>>> >>>> def f(MyIntf obj): >>>> # obj *can* be MyImpl instance, or whatever else that supports >>>> # that interface >>>> obj.method(3.4) >>>> >>>> >>>> Now, the idea to implement this is: >>>> >>>> a) Both caller and callee pre-hash name/argument string >>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>>> md5) >>>> >>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>>> hashing. What you do is define a final hash fh as a function >>>> of the pre-hash ph, for instance >>>> >>>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & >>>> vtable.m >>>> >>>> (Me and Robert are benchmarking different functions to use here.) By >>>> playing with r1, r2, r3, you have 64**3 choices of hash function, >>and >>>> will be able to pick a combination which gives *no* (or very few) >>>> collisions. >>>> >>>> c) Caller then combines the pre-hash generated at compile-time, with >>>> r1, r2, r3, m stored in the vtable header, in order to find the >>>> final location in the hash-table. >>>> >>>> The exciting thing is that in benchmark, the performance penalty is >>>> actually very slight over a C++-style v-table. (Of course you can >>>> cache a proper vtable, but the fact that you get so close without >>>> caring about caching means that this can be done much faster.) >> >>One advantage about caching a vtable is that one can possibly put in >>adapters for non-exact matches. It also opens up the possibility of >>putting in stubs to call def methods if they exist. This needs to be >>fleshed out more, (another CEP :) but could provide for a >>backwards-compatible easy first implementation. >> >>>> Back to my and Robert's discussion on benchmarks: >>>> >>>> I've uploaded benchmarks here: >>>> >>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>>> >>>> I've changed the benchmark taking to give more robust numbers (at >>>> least for me), you want to look at the 'min' column. >>>> >>>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>>> So we don't pass 'h' on the stack, but either a) looks it up in a >>global >>>> variable (default), or b) it's a compile-time constant (immediate in >>>> assembly) (compile with -DIMHASH). >>>> >>>> Similarly, the ID is either an "interned" global variable, or an >>>> immediate (-DIMID). >>>> >>>> The results are very different on my machine depending on this >>aspect. >>>> My conclusions: >>>> >>>> - Both three shifts with masking, two shifts with a "fallback slot" >>>> (allowing for a single collision), three shifts, two shifts with >>>> two masks allows for constructing good vtables. In the case of only >>>> two shifts, one colliding method gets the twoshift+fback >>>> performance and the rest gets the twoshift performance. >>>> >>>> - Performance is really more affected by whether hashes are >>>> immediates or global variables than the hash function. This is in >>>> contrast to the interning vs. key benchmarks -- so I think that if >>>> we looked up the vtable through PyTypeObject, rather than getting >>>> the vtable directly, the loads of the global variables could >>>> potentially be masked by that. >>>> >>>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>>> and the ID-ing (don't bother with any interning), and compile the >>>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>> probability is incredibly small, b) it would only matter in >>>> situations that should cause an AttributeError anyway, c) if we >>>> really care, we can always use an interning-like mechanism to >>>> validate on module loading that its hashes doesn't collide with >>>> other hashes (and raise an exception "Congratulations, you've >>>> discovered a phenomenal md5 collision, get in touch with cython >>>> devs and we'll work around it right away"). >> >>Due to the birthday paradox, this seems a bit risky. Maybe it's >>because I regularly work with collections much bigger than 2^32, and I >>suppose we're talking about unique method names and signatures here, >>but still... I wonder what the penalty would be for checking the full >>128 bit hash. (Storing it could allow for greater entropy in the >>optimal hash table search as well). >> >>> What I forgot to mention: >>> >>> ?- I really want to avoid linear probing just because of the code >>bloat in >>> call sites. >> >>That's a good point. What about flags--are we throwing out the idea of >>masking? >> >>> With two shifts, when there was a failure to find a perfect hash >>> it was always possible to find one with a single collision. >>> >>> ?- Probing for the hash with two shifts is lightning fast, it can >>take a >>> while with three shifts (though you can always spend more memory on a >>bigger >>> table to make it fast again). However, it makes me uneasy to penalize >>the >>> performance of calling one of the random methods, so I'm really in >>favour of >>> three-shifts or double-mask (to be decided when investigating the >>> performance of probing for parameters in more detail). >>> >>> ?- I tried using SSE to do shifts in parallel and failed (miserable >>> performance). The problem is quickly moving things between general >>purpose >>> registers and SSE registers, and the lack of SSE immediates/constants >>in the >>> instruction stream. At least, what my gcc 4.6 generates appeared to >>use the >>> stack to communicate between SSE registers and general purpose >>registers >>> (but I can't have been doing the right thing..). >>> >>> >>> >>>> >>>> The RTTI (i.e. the char*) is also put in there, but is not used for >>>> comparison and is not interned. >>>> >>>> At least, that's what I think we should do for duck-style vtables. >>>> >>>> Do we then go to all the pain of defining key-encoding, interning >>>> etc. just for SEP 201? Isn't it easier to just mandate a md5 >>dependency >>>> and be done with it? (After all, md5 usually comes with Python in >>the >>>> md5 and hashlib modules) >>>> >>>> direct: Early-binding >>>> index: Call slot 0 (C++-style vtable/function pointer) >>>> noshift: h & m1 >>>> oneshift: (h >> r1) & m1 >>>> twoshift: ((h >> r1) ^ (h >> r2)) & m1 >>>> twoshift+fback: hash doesn't >>> >>> >>> I meant: Hash collision and then, after a branch miss, look up the >>one >>> fallback slot in the vtable header. >> >>We could also do a fallback table. Usually it'd be empty, Occasionally >>it'd have one element in it. It'd always be possible to make this big >>enough to avoid collisions in a worst-case scenario. >> >>BTW, this is a general static char* -> void* dictionary, I bet it >>could possibly have other uses. (It may also be a well-studied >>problem, though a bit hard to search for...) I suppose we could reduce >>it to read-optimized int -> int mappings. > > > The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. I just found http://cmph.sourceforge.net/ which looks quite interesting. Though the resulting hash functions are supposedly cheap, I have the feeling that branching is considered cheap in this context. > 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. True. I don't have a good way to quantify the correlation between different shifts of the same value (vs. truly random values) but it didn't seem to be very significant in the experiments. > From the interning/key benchmarks, checking the full 128 bits would probably not be noticeable in microbenchmarks, it's more about using an extra register and bloating the instruction cache and data cache a bit etc, stuff that can only be measured in production. One could make the check optionally omitted at compile time. It would still bloat the table, but not by much (or at all if we share with flag bits as suggested below). > The alternative is having a collision detection registry. If it complains, you're told where to edit Cython (perhaps a datafile) so that the pre-hash function changes: > > if signature equals 'foo:ddffi' > ? # known collision with 'bar:ii' > ? ?Use high 64 bits of md5 > Else: > ? Use low 64 bits of md5 > > Each such collision is documented in the cep/sep. > > But 128 bit and then relying on luck is perhaps simpler... Much. > If we need flags, lets say that 92 bits suffice. for hash and use 16 for flags... > > But i was thinking that you'd have separate tables for nogil callers and gil-holding callers so that you didn't need to scan for matching flags. We really want this to be branch-miss-free. Still, flags are good for error return codes etc. Duplicate tables works as long as there aren't too many orthogonal considerations. Is the GIL the only one? What about "I can propagate errors?" Now we're up to 4 tables... > Do you agree on forgetting about the encoded keys/interning even for SEP 201? There's only so much effort to go around and I'd much rather use md5 and these hash tables everywhere. Yes, for sure! - Robert From stefan_ml at behnel.de Tue Jun 5 09:25:44 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 05 Jun 2012 09:25:44 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCD100B.7000008@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> Message-ID: <4FCDB478.3070000@behnel.de> Dag Sverre Seljebotn, 04.06.2012 21:44: > This can cause crashes/stack smashes > etc. if there's lower-64bit-of-md5 collisions, but a) the > probability is incredibly small, b) it would only matter in > situations that should cause an AttributeError anyway, c) if we > really care, we can always use an interning-like mechanism to > validate on module loading that its hashes doesn't collide with > other hashes (and raise an exception "Congratulations, you've > discovered a phenomenal md5 collision, get in touch with cython > devs and we'll work around it right away"). I'm not a big fan of such an attitude. If this happens at runtime, it can induce any cost from cheap-at-test-time to hugely-expensive-in-production. Thinking with my evil hat on, this can potentially be data triggered from the outside (e.g. if a JIT compiler is involved at one end), thus possibly even leading to a security hole. We should try to produce software that others can build a business on. Stefan From stefan_ml at behnel.de Tue Jun 5 10:07:10 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 05 Jun 2012 10:07:10 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> Message-ID: <4FCDBE2E.2070205@behnel.de> Dag Sverre Seljebotn, 05.06.2012 00:07: > The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. > > 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. Perfect hashing can be done with any fixed size data set (although it's not guaranteed to always be the most efficient solution). It doesn't matter if you use 64 bits or 128 bits. If 4 bits is enough, go with that. The advantage of perfect hashing of a fixed size data set is that the hash table has no free space and a match is guaranteed to be exact. However, the problem in this specific case is that the caller and the callee do not agree on the same set of entries, so there may be collisions during the lookup (of a potentially very large set of signatures) that were not anticipated in the perfect hash table layout (of the much smaller set of provided signatures). Perfect hashing works here as well, but it looses one of its main advantage over other hashing schemes. You then have to compare the entries exactly after the lookup in order to make sure that you didn't run into a collision, thus loosing time again that you just won with the hashing. But at least you only have to do exactly one such comparison, so that's an advantage over a hashing scheme that allows collisions also in the layout. Maybe you can even handle mismatches more quickly by adding a dedicated "empty" entry for them that most (all?) anticipated mismatches would hash to. Stefan From robertwb at gmail.com Tue Jun 5 11:16:34 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Tue, 5 Jun 2012 02:16:34 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCDBE2E.2070205@behnel.de> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCDBE2E.2070205@behnel.de> Message-ID: On Tue, Jun 5, 2012 at 1:07 AM, Stefan Behnel wrote: > Dag Sverre Seljebotn, 05.06.2012 00:07: >> The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. >> >> 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. > > Perfect hashing can be done with any fixed size data set (although it's not > guaranteed to always be the most efficient solution). It doesn't matter if > you use 64 bits or 128 bits. If 4 bits is enough, go with that. The > advantage of perfect hashing of a fixed size data set is that the hash > table has no free space and a match is guaranteed to be exact. The hash function is f(h(sig)) where f is parameterized but must be *extremely* cheap and h is fixed without regard to the entry set. This is why having 128 bits for the output of h may be an advantage. > However, the problem in this specific case is that the caller and the > callee do not agree on the same set of entries, so there may be collisions > during the lookup (of a potentially very large set of signatures) that were > not anticipated in the perfect hash table layout (of the much smaller set > of provided signatures). Perfect hashing works here as well, but it looses > one of its main advantage over other hashing schemes. You then have to > compare the entries exactly after the lookup in order to make sure that you > didn't run into a collision, thus loosing time again that you just won with > the hashing. > > But at least you only have to do exactly one such comparison, so that's an > advantage over a hashing scheme that allows collisions also in the layout. > Maybe you can even handle mismatches more quickly by adding a dedicated > "empty" entry for them that most (all?) anticipated mismatches would hash to. The idea is that the comparison would be cheap, a single 128-bit compare. The whole point is to avoid branching in the success case. I agree with you about 64-bit collisions being too high a risk. One could re-introduce the encoding/interning if desired, but I think we're safe in assuming no accidental md5 collisions (but hadn't thought much about the malicious case; if you're allowed to dictate function pointers you'd better have another line of defense. Perhaps this needs to be considered more.) We could even use sha1, though I thought the previous benchmarks indicated that comparing 160 bits was non-negligibly more expensive than comparing just 64. - Robert From d.s.seljebotn at astro.uio.no Tue Jun 5 18:56:46 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 18:56:46 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCDBE2E.2070205@behnel.de> Message-ID: <4FCE3A4E.7070202@astro.uio.no> On 06/05/2012 11:16 AM, Robert Bradshaw wrote: > On Tue, Jun 5, 2012 at 1:07 AM, Stefan Behnel wrote: >> Dag Sverre Seljebotn, 05.06.2012 00:07: >>> The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. >>> >>> 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. >> >> Perfect hashing can be done with any fixed size data set (although it's not >> guaranteed to always be the most efficient solution). It doesn't matter if >> you use 64 bits or 128 bits. If 4 bits is enough, go with that. The >> advantage of perfect hashing of a fixed size data set is that the hash >> table has no free space and a match is guaranteed to be exact. > > The hash function is f(h(sig)) where f is parameterized but must be > *extremely* cheap and h is fixed without regard to the entry set. This > is why having 128 bits for the output of h may be an advantage. > >> However, the problem in this specific case is that the caller and the >> callee do not agree on the same set of entries, so there may be collisions >> during the lookup (of a potentially very large set of signatures) that were >> not anticipated in the perfect hash table layout (of the much smaller set >> of provided signatures). Perfect hashing works here as well, but it looses >> one of its main advantage over other hashing schemes. You then have to >> compare the entries exactly after the lookup in order to make sure that you >> didn't run into a collision, thus loosing time again that you just won with >> the hashing. Me and Robert spent some time on those benchmarks, please bother reading them before making statements like this. There's benchmarks both with a 64-bit comparison of an interned ID and comparison with the compile-time 64-bit hash (faster). All my benchmarks included some comparison after the lookup. Comparison is very cheap *if* it is the likely() path. Branch misses is what counts. Perfect hashing, even with comparison, wins you big-time in branch prediction. >> But at least you only have to do exactly one such comparison, so that's an >> advantage over a hashing scheme that allows collisions also in the layout. >> Maybe you can even handle mismatches more quickly by adding a dedicated >> "empty" entry for them that most (all?) anticipated mismatches would hash to. You mean, like what I did in the twoshift+fback benchmark? Getting a single branch miss makes it the slowest one. But all other methods (the ones that doesn't collide) run slightly faster than with three shifts. > The idea is that the comparison would be cheap, a single 128-bit > compare. The whole point is to avoid branching in the success case. > > I agree with you about 64-bit collisions being too high a risk. One > could re-introduce the encoding/interning if desired, but I think > we're safe in assuming no accidental md5 collisions (but hadn't > thought much about the malicious case; if you're allowed to dictate > function pointers you'd better have another line of defense. Perhaps > this needs to be considered more.) We could even use sha1, though I I fail to understand the comment on security at all. Why not just use the *correct* signature to feed a function that intentionally segfaults (or does whatever else)? > thought the previous benchmarks indicated that comparing 160 bits was > non-negligibly more expensive than comparing just 64. Loading an interned ID from a global variable was certainly non-negligible too. Let's do some numbers on how many bits we need here: Ballpark estimate: Assume that 50 billion lines of Cython code will be written over the course of human history (that's like SAGE times 200,000). Now assume that every 100 lines of code people write, there's an entirely new method declaration that's never, ever in the entire human history has been written before in Cython => 2**22 signatures will occur. The total probability that a *single* collision (or more) will *ever* happen over the course of human history is: 64 bit ID: 5e-7 96 bit ID: 1e-17 128 bit ID: 3e-26 160 bit ID: 6e-36 Computed with, e.g.,: sage: R=RealField(1000) sage: n=R(2)**22 sage: 1 - exp(-n * (n-1) / 2 / R(2)**160) http://en.wikipedia.org/wiki/Birthday_problem Dag From d.s.seljebotn at astro.uio.no Tue Jun 5 19:01:19 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 19:01:19 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCDB478.3070000@behnel.de> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> Message-ID: <4FCE3B5F.9080603@astro.uio.no> On 06/05/2012 09:25 AM, Stefan Behnel wrote: > Dag Sverre Seljebotn, 04.06.2012 21:44: >> This can cause crashes/stack smashes >> etc. if there's lower-64bit-of-md5 collisions, but a) the >> probability is incredibly small, b) it would only matter in >> situations that should cause an AttributeError anyway, c) if we >> really care, we can always use an interning-like mechanism to >> validate on module loading that its hashes doesn't collide with >> other hashes (and raise an exception "Congratulations, you've >> discovered a phenomenal md5 collision, get in touch with cython >> devs and we'll work around it right away"). > > I'm not a big fan of such an attitude. If this happens at runtime, it can > induce any cost from cheap-at-test-time to hugely-expensive-in-production. > Thinking with my evil hat on, this can potentially be data triggered from > the outside (e.g. if a JIT compiler is involved at one end), thus possibly > even leading to a security hole. > > We should try to produce software that others can build a business on. Well, I'd build a business on something that fails with a 5e-7 probability any day :-) (given that you trust my estimates in the other post; I think they were rather conservative myself) But I'll do benchmarks for 96-bit and 128 bit hash comparisons as soon as I can get around to it. Dag From d.s.seljebotn at astro.uio.no Tue Jun 5 19:09:37 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 19:09:37 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCE3B5F.9080603@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> Message-ID: <4FCE3D51.20009@astro.uio.no> On 06/05/2012 07:01 PM, Dag Sverre Seljebotn wrote: > On 06/05/2012 09:25 AM, Stefan Behnel wrote: >> Dag Sverre Seljebotn, 04.06.2012 21:44: >>> This can cause crashes/stack smashes >>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>> probability is incredibly small, b) it would only matter in >>> situations that should cause an AttributeError anyway, c) if we >>> really care, we can always use an interning-like mechanism to >>> validate on module loading that its hashes doesn't collide with >>> other hashes (and raise an exception "Congratulations, you've >>> discovered a phenomenal md5 collision, get in touch with cython >>> devs and we'll work around it right away"). >> >> I'm not a big fan of such an attitude. If this happens at runtime, it can >> induce any cost from cheap-at-test-time to >> hugely-expensive-in-production. >> Thinking with my evil hat on, this can potentially be data triggered from >> the outside (e.g. if a JIT compiler is involved at one end), thus >> possibly >> even leading to a security hole. >> >> We should try to produce software that others can build a business on. > > Well, I'd build a business on something that fails with a 5e-7 > probability any day :-) (given that you trust my estimates in the other > post; I think they were rather conservative myself) This was put the wrong way. The chance was 5e-7 that it would fail for anybody over the course of human history (and that was a rather pessimistic estimate). So a more "individual tack": Assume that the process contains 200 MB of method definitions alone, with each method definition being a 8 character string. (That should mean the executable should be several gigabytes :-)) That puts the probability of collision at 10^-34 for that process containing a 64-bit hash collision. Dag From markflorisson88 at gmail.com Tue Jun 5 20:02:04 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Tue, 5 Jun 2012 19:02:04 +0100 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCE3D51.20009@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> Message-ID: On 5 June 2012 18:09, Dag Sverre Seljebotn wrote: > On 06/05/2012 07:01 PM, Dag Sverre Seljebotn wrote: >> >> On 06/05/2012 09:25 AM, Stefan Behnel wrote: >>> >>> Dag Sverre Seljebotn, 04.06.2012 21:44: >>>> >>>> This can cause crashes/stack smashes >>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>> probability is incredibly small, b) it would only matter in >>>> situations that should cause an AttributeError anyway, c) if we >>>> really care, we can always use an interning-like mechanism to >>>> validate on module loading that its hashes doesn't collide with >>>> other hashes (and raise an exception "Congratulations, you've >>>> discovered a phenomenal md5 collision, get in touch with cython >>>> devs and we'll work around it right away"). >>> >>> >>> I'm not a big fan of such an attitude. If this happens at runtime, it can >>> induce any cost from cheap-at-test-time to >>> hugely-expensive-in-production. >>> Thinking with my evil hat on, this can potentially be data triggered from >>> the outside (e.g. if a JIT compiler is involved at one end), thus >>> possibly >>> even leading to a security hole. >>> >>> We should try to produce software that others can build a business on. >> >> >> Well, I'd build a business on something that fails with a 5e-7 >> probability any day :-) (given that you trust my estimates in the other >> post; I think they were rather conservative myself) > > > This was put the wrong way. The chance was 5e-7 that it would fail for > anybody over the course of human history (and that was a rather pessimistic > estimate). > > So a more "individual tack": > > Assume that the process contains 200 MB of method definitions alone, with > each method definition being a 8 character string. (That should mean the > executable should be several gigabytes :-)) > > That puts the probability of collision at 10^-34 for that process containing > a 64-bit hash collision. > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel The point is not so much running into this problem accidentally, but maliciously. If user input from untrusted users can somehow determine the function signatures that are generated and called by a JIT, then a malicious user can find collisions offline and cause some fault in a valid user program. From d.s.seljebotn at astro.uio.no Tue Jun 5 21:33:16 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 21:33:16 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> Message-ID: <4FCE5EFC.30407@astro.uio.no> On 06/05/2012 08:02 PM, mark florisson wrote: > On 5 June 2012 18:09, Dag Sverre Seljebotn wrote: >> On 06/05/2012 07:01 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/05/2012 09:25 AM, Stefan Behnel wrote: >>>> >>>> Dag Sverre Seljebotn, 04.06.2012 21:44: >>>>> >>>>> This can cause crashes/stack smashes >>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>>> probability is incredibly small, b) it would only matter in >>>>> situations that should cause an AttributeError anyway, c) if we >>>>> really care, we can always use an interning-like mechanism to >>>>> validate on module loading that its hashes doesn't collide with >>>>> other hashes (and raise an exception "Congratulations, you've >>>>> discovered a phenomenal md5 collision, get in touch with cython >>>>> devs and we'll work around it right away"). >>>> >>>> >>>> I'm not a big fan of such an attitude. If this happens at runtime, it can >>>> induce any cost from cheap-at-test-time to >>>> hugely-expensive-in-production. >>>> Thinking with my evil hat on, this can potentially be data triggered from >>>> the outside (e.g. if a JIT compiler is involved at one end), thus >>>> possibly >>>> even leading to a security hole. >>>> >>>> We should try to produce software that others can build a business on. >>> >>> >>> Well, I'd build a business on something that fails with a 5e-7 >>> probability any day :-) (given that you trust my estimates in the other >>> post; I think they were rather conservative myself) >> >> >> This was put the wrong way. The chance was 5e-7 that it would fail for >> anybody over the course of human history (and that was a rather pessimistic >> estimate). >> >> So a more "individual tack": >> >> Assume that the process contains 200 MB of method definitions alone, with >> each method definition being a 8 character string. (That should mean the >> executable should be several gigabytes :-)) >> >> That puts the probability of collision at 10^-34 for that process containing >> a 64-bit hash collision. >> >> >> Dag >> _______________________________________________ >> cython-devel mailing list >> cython-devel at python.org >> http://mail.python.org/mailman/listinfo/cython-devel > > The point is not so much running into this problem accidentally, but > maliciously. If user input from untrusted users can somehow determine > the function signatures that are generated and called by a JIT, then a > malicious user can find collisions offline and cause some fault in a > valid user program. This took me a while to understand. So the idea is that you're in a completely managed environment (like Java), and you want to run untrusted code and have it not segfault or smash the stack. Eve then cleverly assembles a caller/callee pair with mismatching signatures but the same hash. Yes, in that situation 64 bits is perhaps not enough. But is this relevant to what we're trying to do here? We're discussing APIs to talk between Python C extension modules that already have unlimited powers. I'd think a "managed Cython" would be such a large change that one could easily change the hash size at that point? But I agree it's not as easily written off as I thought. Dag From d.s.seljebotn at astro.uio.no Tue Jun 5 22:10:04 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 22:10:04 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> Message-ID: <4FCE679C.7000002@astro.uio.no> On 06/04/2012 11:43 PM, Robert Bradshaw wrote: > On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn > wrote: >> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>> >>> Me and Robert had a long discussion on the NumFOCUS list about this >>> already, but I figured it was better to continue it and provide more >>> in-depth benchmark results here. >>> >>> It's basically a new idea of how to provide a vtable based on perfect >>> hashing, which should be a lot simpler to implement than what I first >>> imagined. >>> >>> I'll write down some context first, if you're familiar with this >>> skip ahead a bit.. >>> >>> This means that you can do fast dispatches *without* the messy >>> business of binding vtable slots at compile time. To be concrete, this >>> might e.g. take the form >>> >>> def f(obj): >>> obj.method(3.4) # try to find a vtable with "void method(double)" in it >>> >>> or, a more typed approach, >>> >>> # File A >>> cdef class MyImpl: >>> def double method(double x): return x * x >>> >>> # File B >>> # Here we never know about MyImpl, hence "duck-typed" >>> @cython.interface >>> class MyIntf: >>> def double method(double x): pass >>> >>> def f(MyIntf obj): >>> # obj *can* be MyImpl instance, or whatever else that supports >>> # that interface >>> obj.method(3.4) >>> >>> >>> Now, the idea to implement this is: >>> >>> a) Both caller and callee pre-hash name/argument string >>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>> md5) >>> >>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>> hashing. What you do is define a final hash fh as a function >>> of the pre-hash ph, for instance >>> >>> fh = ((ph>> vtable.r1) ^ (ph>> vtable.r2) ^ (ph>> vtable.r3))& >>> vtable.m >>> >>> (Me and Robert are benchmarking different functions to use here.) By >>> playing with r1, r2, r3, you have 64**3 choices of hash function, and >>> will be able to pick a combination which gives *no* (or very few) >>> collisions. >>> >>> c) Caller then combines the pre-hash generated at compile-time, with >>> r1, r2, r3, m stored in the vtable header, in order to find the >>> final location in the hash-table. >>> >>> The exciting thing is that in benchmark, the performance penalty is >>> actually very slight over a C++-style v-table. (Of course you can >>> cache a proper vtable, but the fact that you get so close without >>> caring about caching means that this can be done much faster.) > > One advantage about caching a vtable is that one can possibly put in > adapters for non-exact matches. It also opens up the possibility of > putting in stubs to call def methods if they exist. This needs to be > fleshed out more, (another CEP :) but could provide for a > backwards-compatible easy first implementation. > >>> Back to my and Robert's discussion on benchmarks: >>> >>> I've uploaded benchmarks here: >>> >>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>> >>> I've changed the benchmark taking to give more robust numbers (at >>> least for me), you want to look at the 'min' column. >>> >>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>> So we don't pass 'h' on the stack, but either a) looks it up in a global >>> variable (default), or b) it's a compile-time constant (immediate in >>> assembly) (compile with -DIMHASH). >>> >>> Similarly, the ID is either an "interned" global variable, or an >>> immediate (-DIMID). >>> >>> The results are very different on my machine depending on this aspect. >>> My conclusions: >>> >>> - Both three shifts with masking, two shifts with a "fallback slot" >>> (allowing for a single collision), three shifts, two shifts with >>> two masks allows for constructing good vtables. In the case of only >>> two shifts, one colliding method gets the twoshift+fback >>> performance and the rest gets the twoshift performance. >>> >>> - Performance is really more affected by whether hashes are >>> immediates or global variables than the hash function. This is in >>> contrast to the interning vs. key benchmarks -- so I think that if >>> we looked up the vtable through PyTypeObject, rather than getting >>> the vtable directly, the loads of the global variables could >>> potentially be masked by that. >>> >>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>> and the ID-ing (don't bother with any interning), and compile the >>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>> probability is incredibly small, b) it would only matter in >>> situations that should cause an AttributeError anyway, c) if we >>> really care, we can always use an interning-like mechanism to >>> validate on module loading that its hashes doesn't collide with >>> other hashes (and raise an exception "Congratulations, you've >>> discovered a phenomenal md5 collision, get in touch with cython >>> devs and we'll work around it right away"). > > Due to the birthday paradox, this seems a bit risky. Maybe it's > because I regularly work with collections much bigger than 2^32, and I > suppose we're talking about unique method names and signatures here, > but still... I wonder what the penalty would be for checking the full > 128 bit hash. (Storing it could allow for greater entropy in the > optimal hash table search as well). Wonder no more. Here's the penalty for different bit-lengths, all compile-time constants: threeshift: min=6.08e-09 mean=6.11e-09 std=2.81e-11 val=1200000000.000000 threeshift96: min=7.53e-09 mean=7.55e-09 std=1.96e-11 val=1200000000.000000 threeshift128: min=6.95e-09 mean=6.97e-09 std=2.57e-11 val=1200000000.000000 threeshift160: min=8.17e-09 mean=8.23e-09 std=4.06e-11 val=1200000000.000000 And for comparison, when loading the comparison IDs from global variable: threeshift: min=6.46e-09 mean=6.52e-09 std=4.95e-11 val=1200000000.000000 threeshift96: min=8.07e-09 mean=8.16e-09 std=4.55e-11 val=1200000000.000000 threeshift128: min=8.06e-09 mean=8.18e-09 std=6.71e-11 val=1200000000.000000 threeshift160: min=9.71e-09 mean=9.83e-09 std=5.12e-11 val=1200000000.000000 So indeed, 64-bit hash < interning < 128 bit hash (At least on my Intel Nehalem Core i7 1.87GhZ) And the load of the global variable may in real life be hidden by other things going on in the function. And, you save vtable memory by having an interned char* and not saving the hash in the vtable. They should be made more easily runnable so that we could run them on various systems, but it makes sense to first read up on and figure out which hash functions are really viable, to keep the number of numbers down. I just realized that I never pushed the changes I did to introduce -DIMHASH/-DIMID etc., but the benchmarks are pushed now. > We could also do a fallback table. Usually it'd be empty, Occasionally > it'd have one element in it. It'd always be possible to make this big > enough to avoid collisions in a worst-case scenario. If you do a fallback table it's as much code in the call site as linear probing... But when I played with the generation side, a failure to create a table at a given size would *always* be due to a single collision. This is what I did in the twoshift+fback benchmark. > Duplicate tables works as long as there aren't too many orthogonal > considerations. Is the GIL the only one? What about "I can propagate > errors?" Now we're up to 4 tables... Would your decision of whether or not to dispatch to a function depend on whether or not it propagates errors? I'm thinking of the "with gil" function case, i.e. callee has: a) Function to call if you have the GIL b) GIL-acquiring wrapper and you want GIL-holding code to call a) and nogil code to call b). But one could just make the caller acquire the GIL if needed (which in that case is so expensive anyway that it can be made the unlikely() path). I can't think of other situations where you would pick which function to call based on flags. Dag From markflorisson88 at gmail.com Tue Jun 5 22:33:12 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Tue, 5 Jun 2012 21:33:12 +0100 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCE5EFC.30407@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> <4FCE5EFC.30407@astro.uio.no> Message-ID: On 5 June 2012 20:33, Dag Sverre Seljebotn wrote: > On 06/05/2012 08:02 PM, mark florisson wrote: >> >> On 5 June 2012 18:09, Dag Sverre Seljebotn >> ?wrote: >>> >>> On 06/05/2012 07:01 PM, Dag Sverre Seljebotn wrote: >>>> >>>> >>>> On 06/05/2012 09:25 AM, Stefan Behnel wrote: >>>>> >>>>> >>>>> Dag Sverre Seljebotn, 04.06.2012 21:44: >>>>>> >>>>>> >>>>>> This can cause crashes/stack smashes >>>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>>>> probability is incredibly small, b) it would only matter in >>>>>> situations that should cause an AttributeError anyway, c) if we >>>>>> really care, we can always use an interning-like mechanism to >>>>>> validate on module loading that its hashes doesn't collide with >>>>>> other hashes (and raise an exception "Congratulations, you've >>>>>> discovered a phenomenal md5 collision, get in touch with cython >>>>>> devs and we'll work around it right away"). >>>>> >>>>> >>>>> >>>>> I'm not a big fan of such an attitude. If this happens at runtime, it >>>>> can >>>>> induce any cost from cheap-at-test-time to >>>>> hugely-expensive-in-production. >>>>> Thinking with my evil hat on, this can potentially be data triggered >>>>> from >>>>> the outside (e.g. if a JIT compiler is involved at one end), thus >>>>> possibly >>>>> even leading to a security hole. >>>>> >>>>> We should try to produce software that others can build a business on. >>>> >>>> >>>> >>>> Well, I'd build a business on something that fails with a 5e-7 >>>> probability any day :-) (given that you trust my estimates in the other >>>> post; I think they were rather conservative myself) >>> >>> >>> >>> This was put the wrong way. The chance was 5e-7 that it would fail for >>> anybody over the course of human history (and that was a rather >>> pessimistic >>> estimate). >>> >>> So a more "individual tack": >>> >>> Assume that the process contains 200 MB of method definitions alone, with >>> each method definition being a 8 character string. (That should mean the >>> executable should be several gigabytes :-)) >>> >>> That puts the probability of collision at 10^-34 for that process >>> containing >>> a 64-bit hash collision. >>> >>> >>> Dag >>> _______________________________________________ >>> cython-devel mailing list >>> cython-devel at python.org >>> http://mail.python.org/mailman/listinfo/cython-devel >> >> >> The point is not so much running into this problem accidentally, but >> maliciously. If user input from untrusted users can somehow determine >> the function signatures that are generated and called by a JIT, then a >> malicious user can find collisions offline and cause some fault in a >> valid user program. > > > This took me a while to understand. So the idea is that you're in a > completely managed environment (like Java), and you want to run untrusted > code and have it not segfault or smash the stack. Eve then cleverly > assembles a caller/callee pair with mismatching signatures but the same > hash. > > Yes, in that situation 64 bits is perhaps not enough. > > But is this relevant to what we're trying to do here? We're discussing APIs > to talk between Python C extension modules that already have unlimited > powers. I'd think a "managed Cython" would be such a large change that one > could easily change the hash size at that point? > > But I agree it's not as easily written off as I thought. > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel It doesn't even necessarily have to be about running user code, a user could craft data input which causes such a situation. For instance, let's say we have a just-in-time specializer which specializes a function for the runtime input types, and the types depend on the user input. For instance, if we write a web application we can post arrays to described by a custom dtype, which draws pictures in some weird way for us. We can get it to specialize pretty much any array type, so that gives us a good opportunity to find collisions. From robertwb at gmail.com Tue Jun 5 22:50:23 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Tue, 5 Jun 2012 13:50:23 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCE679C.7000002@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <4FCE679C.7000002@astro.uio.no> Message-ID: On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn wrote: > On 06/04/2012 11:43 PM, Robert Bradshaw wrote: >> >> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn >> ?wrote: >>> >>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>>> >>>> >>>> Me and Robert had a long discussion on the NumFOCUS list about this >>>> already, but I figured it was better to continue it and provide more >>>> in-depth benchmark results here. >>>> >>>> It's basically a new idea of how to provide a vtable based on perfect >>>> hashing, which should be a lot simpler to implement than what I first >>>> imagined. >>>> >>>> I'll write down some context first, if you're familiar with this >>>> skip ahead a bit.. >>>> >>>> This means that you can do fast dispatches *without* the messy >>>> business of binding vtable slots at compile time. To be concrete, this >>>> might e.g. take the form >>>> >>>> def f(obj): >>>> obj.method(3.4) # try to find a vtable with "void method(double)" in it >>>> >>>> or, a more typed approach, >>>> >>>> # File A >>>> cdef class MyImpl: >>>> def double method(double x): return x * x >>>> >>>> # File B >>>> # Here we never know about MyImpl, hence "duck-typed" >>>> @cython.interface >>>> class MyIntf: >>>> def double method(double x): pass >>>> >>>> def f(MyIntf obj): >>>> # obj *can* be MyImpl instance, or whatever else that supports >>>> # that interface >>>> obj.method(3.4) >>>> >>>> >>>> Now, the idea to implement this is: >>>> >>>> a) Both caller and callee pre-hash name/argument string >>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>>> md5) >>>> >>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>>> hashing. What you do is define a final hash fh as a function >>>> of the pre-hash ph, for instance >>>> >>>> fh = ((ph>> ?vtable.r1) ^ (ph>> ?vtable.r2) ^ (ph>> ?vtable.r3))& >>>> vtable.m >>>> >>>> (Me and Robert are benchmarking different functions to use here.) By >>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and >>>> will be able to pick a combination which gives *no* (or very few) >>>> collisions. >>>> >>>> c) Caller then combines the pre-hash generated at compile-time, with >>>> r1, r2, r3, m stored in the vtable header, in order to find the >>>> final location in the hash-table. >>>> >>>> The exciting thing is that in benchmark, the performance penalty is >>>> actually very slight over a C++-style v-table. (Of course you can >>>> cache a proper vtable, but the fact that you get so close without >>>> caring about caching means that this can be done much faster.) >> >> >> One advantage about caching a vtable is that one can possibly put in >> adapters for non-exact matches. It also opens up the possibility of >> putting in stubs to call def methods if they exist. This needs to be >> fleshed out more, (another CEP :) but could provide for a >> backwards-compatible easy first implementation. >> >>>> Back to my and Robert's discussion on benchmarks: >>>> >>>> I've uploaded benchmarks here: >>>> >>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>>> >>>> I've changed the benchmark taking to give more robust numbers (at >>>> least for me), you want to look at the 'min' column. >>>> >>>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>>> So we don't pass 'h' on the stack, but either a) looks it up in a global >>>> variable (default), or b) it's a compile-time constant (immediate in >>>> assembly) (compile with -DIMHASH). >>>> >>>> Similarly, the ID is either an "interned" global variable, or an >>>> immediate (-DIMID). >>>> >>>> The results are very different on my machine depending on this aspect. >>>> My conclusions: >>>> >>>> - Both three shifts with masking, two shifts with a "fallback slot" >>>> (allowing for a single collision), three shifts, two shifts with >>>> two masks allows for constructing good vtables. In the case of only >>>> two shifts, one colliding method gets the twoshift+fback >>>> performance and the rest gets the twoshift performance. >>>> >>>> - Performance is really more affected by whether hashes are >>>> immediates or global variables than the hash function. This is in >>>> contrast to the interning vs. key benchmarks -- so I think that if >>>> we looked up the vtable through PyTypeObject, rather than getting >>>> the vtable directly, the loads of the global variables could >>>> potentially be masked by that. >>>> >>>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>>> and the ID-ing (don't bother with any interning), and compile the >>>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>> probability is incredibly small, b) it would only matter in >>>> situations that should cause an AttributeError anyway, c) if we >>>> really care, we can always use an interning-like mechanism to >>>> validate on module loading that its hashes doesn't collide with >>>> other hashes (and raise an exception "Congratulations, you've >>>> discovered a phenomenal md5 collision, get in touch with cython >>>> devs and we'll work around it right away"). >> >> >> Due to the birthday paradox, this seems a bit risky. Maybe it's >> because I regularly work with collections much bigger than 2^32, and I >> suppose we're talking about unique method names and signatures here, >> but still... I wonder what the penalty would be for checking the full >> 128 bit hash. (Storing it could allow for greater entropy in the >> optimal hash table search as well). > > > Wonder no more. Here's the penalty for different bit-lengths, all > compile-time constants: > > ? ? threeshift: min=6.08e-09 ?mean=6.11e-09 ?std=2.81e-11 > val=1200000000.000000 > ? threeshift96: min=7.53e-09 ?mean=7.55e-09 ?std=1.96e-11 > val=1200000000.000000 > ?threeshift128: min=6.95e-09 ?mean=6.97e-09 ?std=2.57e-11 > val=1200000000.000000 > ?threeshift160: min=8.17e-09 ?mean=8.23e-09 ?std=4.06e-11 > val=1200000000.000000 > > And for comparison, when loading the comparison IDs from global variable: > > ? ? threeshift: min=6.46e-09 ?mean=6.52e-09 ?std=4.95e-11 > val=1200000000.000000 > ? threeshift96: min=8.07e-09 ?mean=8.16e-09 ?std=4.55e-11 > val=1200000000.000000 > ?threeshift128: min=8.06e-09 ?mean=8.18e-09 ?std=6.71e-11 > val=1200000000.000000 > ?threeshift160: min=9.71e-09 ?mean=9.83e-09 ?std=5.12e-11 > val=1200000000.000000 > > So indeed, > > 64-bit hash < interning < 128 bit hash > > (At least on my Intel Nehalem Core i7 1.87GhZ) > > And the load of the global variable may in real life be hidden by other > things going on in the function. > > And, you save vtable memory by having an interned char* and not saving the > hash in the vtable. I'm OK with using the 64-bit hash with a macro to enable further checking. If it becomes an issue, we can partition the vtable into two separate structures (hash64/pointer/flags? + hash160/char*/metadata). That's probably overkill. With an eye to security, perhaps the spec should be sha1 (or sha2?, not sure if that ships with Python). > They should be made more easily runnable so that we could run them on > various systems, but it makes sense to first read up on and figure out which > hash functions are really viable, to keep the number of numbers down. > > I just realized that I never pushed the changes I did to introduce > -DIMHASH/-DIMID etc., but the benchmarks are pushed now. > > > >> We could also do a fallback table. Usually it'd be empty, Occasionally >> it'd have one element in it. It'd always be possible to make this big >> enough to avoid collisions in a worst-case scenario. > > > If you do a fallback table it's as much code in the call site as linear > probing... Is linear probing that bad? It's an extra increment and compare in the miss case. > But when I played with the generation side, a failure to create a table at a > given size would *always* be due to a single collision. This is what I did > in the twoshift+fback benchmark. But it won't always be. One can always increase the size of the main table however, if two collisions are rare enough. >> Duplicate tables works as long as there aren't too many orthogonal >> considerations. Is the GIL the only one? What about "I can propagate >> errors?" Now we're up to 4 tables... > > Would your decision of whether or not to dispatch to a function depend on > whether or not it propagates errors? > > I'm thinking of the "with gil" function case, i.e. callee has: > > ?a) Function to call if you have the GIL > ?b) GIL-acquiring wrapper > > and you want GIL-holding code to call a) and nogil code to call b). > > But one could just make the caller acquire the GIL if needed (which in that > case is so expensive anyway that it can be made the unlikely() path). Are you saying you'd add code to the call site to determine if it needs (and conditionally acquire) the GIL? > I can't think of other situations where you would pick which function to > call based on flags. If the caller doesn't propagate errors, it may want to have different codepaths depending on whether the callee propagates them. - Robert From d.s.seljebotn at astro.uio.no Tue Jun 5 23:41:15 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 05 Jun 2012 23:41:15 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <4FCE679C.7000002@astro.uio.no> Message-ID: <4FCE7CFB.7000205@astro.uio.no> On 06/05/2012 10:50 PM, Robert Bradshaw wrote: > On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn > wrote: >> On 06/04/2012 11:43 PM, Robert Bradshaw wrote: >>> >>> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn >>> wrote: >>>> >>>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>>>> >>>>> >>>>> Me and Robert had a long discussion on the NumFOCUS list about this >>>>> already, but I figured it was better to continue it and provide more >>>>> in-depth benchmark results here. >>>>> >>>>> It's basically a new idea of how to provide a vtable based on perfect >>>>> hashing, which should be a lot simpler to implement than what I first >>>>> imagined. >>>>> >>>>> I'll write down some context first, if you're familiar with this >>>>> skip ahead a bit.. >>>>> >>>>> This means that you can do fast dispatches *without* the messy >>>>> business of binding vtable slots at compile time. To be concrete, this >>>>> might e.g. take the form >>>>> >>>>> def f(obj): >>>>> obj.method(3.4) # try to find a vtable with "void method(double)" in it >>>>> >>>>> or, a more typed approach, >>>>> >>>>> # File A >>>>> cdef class MyImpl: >>>>> def double method(double x): return x * x >>>>> >>>>> # File B >>>>> # Here we never know about MyImpl, hence "duck-typed" >>>>> @cython.interface >>>>> class MyIntf: >>>>> def double method(double x): pass >>>>> >>>>> def f(MyIntf obj): >>>>> # obj *can* be MyImpl instance, or whatever else that supports >>>>> # that interface >>>>> obj.method(3.4) >>>>> >>>>> >>>>> Now, the idea to implement this is: >>>>> >>>>> a) Both caller and callee pre-hash name/argument string >>>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>>>> md5) >>>>> >>>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>>>> hashing. What you do is define a final hash fh as a function >>>>> of the pre-hash ph, for instance >>>>> >>>>> fh = ((ph>> vtable.r1) ^ (ph>> vtable.r2) ^ (ph>> vtable.r3))& >>>>> vtable.m >>>>> >>>>> (Me and Robert are benchmarking different functions to use here.) By >>>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and >>>>> will be able to pick a combination which gives *no* (or very few) >>>>> collisions. >>>>> >>>>> c) Caller then combines the pre-hash generated at compile-time, with >>>>> r1, r2, r3, m stored in the vtable header, in order to find the >>>>> final location in the hash-table. >>>>> >>>>> The exciting thing is that in benchmark, the performance penalty is >>>>> actually very slight over a C++-style v-table. (Of course you can >>>>> cache a proper vtable, but the fact that you get so close without >>>>> caring about caching means that this can be done much faster.) >>> >>> >>> One advantage about caching a vtable is that one can possibly put in >>> adapters for non-exact matches. It also opens up the possibility of >>> putting in stubs to call def methods if they exist. This needs to be >>> fleshed out more, (another CEP :) but could provide for a >>> backwards-compatible easy first implementation. >>> >>>>> Back to my and Robert's discussion on benchmarks: >>>>> >>>>> I've uploaded benchmarks here: >>>>> >>>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>>>> >>>>> I've changed the benchmark taking to give more robust numbers (at >>>>> least for me), you want to look at the 'min' column. >>>>> >>>>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>>>> So we don't pass 'h' on the stack, but either a) looks it up in a global >>>>> variable (default), or b) it's a compile-time constant (immediate in >>>>> assembly) (compile with -DIMHASH). >>>>> >>>>> Similarly, the ID is either an "interned" global variable, or an >>>>> immediate (-DIMID). >>>>> >>>>> The results are very different on my machine depending on this aspect. >>>>> My conclusions: >>>>> >>>>> - Both three shifts with masking, two shifts with a "fallback slot" >>>>> (allowing for a single collision), three shifts, two shifts with >>>>> two masks allows for constructing good vtables. In the case of only >>>>> two shifts, one colliding method gets the twoshift+fback >>>>> performance and the rest gets the twoshift performance. >>>>> >>>>> - Performance is really more affected by whether hashes are >>>>> immediates or global variables than the hash function. This is in >>>>> contrast to the interning vs. key benchmarks -- so I think that if >>>>> we looked up the vtable through PyTypeObject, rather than getting >>>>> the vtable directly, the loads of the global variables could >>>>> potentially be masked by that. >>>>> >>>>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>>>> and the ID-ing (don't bother with any interning), and compile the >>>>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>>> probability is incredibly small, b) it would only matter in >>>>> situations that should cause an AttributeError anyway, c) if we >>>>> really care, we can always use an interning-like mechanism to >>>>> validate on module loading that its hashes doesn't collide with >>>>> other hashes (and raise an exception "Congratulations, you've >>>>> discovered a phenomenal md5 collision, get in touch with cython >>>>> devs and we'll work around it right away"). >>> >>> >>> Due to the birthday paradox, this seems a bit risky. Maybe it's >>> because I regularly work with collections much bigger than 2^32, and I >>> suppose we're talking about unique method names and signatures here, >>> but still... I wonder what the penalty would be for checking the full >>> 128 bit hash. (Storing it could allow for greater entropy in the >>> optimal hash table search as well). >> >> >> Wonder no more. Here's the penalty for different bit-lengths, all >> compile-time constants: >> >> threeshift: min=6.08e-09 mean=6.11e-09 std=2.81e-11 >> val=1200000000.000000 >> threeshift96: min=7.53e-09 mean=7.55e-09 std=1.96e-11 >> val=1200000000.000000 >> threeshift128: min=6.95e-09 mean=6.97e-09 std=2.57e-11 >> val=1200000000.000000 >> threeshift160: min=8.17e-09 mean=8.23e-09 std=4.06e-11 >> val=1200000000.000000 >> >> And for comparison, when loading the comparison IDs from global variable: >> >> threeshift: min=6.46e-09 mean=6.52e-09 std=4.95e-11 >> val=1200000000.000000 >> threeshift96: min=8.07e-09 mean=8.16e-09 std=4.55e-11 >> val=1200000000.000000 >> threeshift128: min=8.06e-09 mean=8.18e-09 std=6.71e-11 >> val=1200000000.000000 >> threeshift160: min=9.71e-09 mean=9.83e-09 std=5.12e-11 >> val=1200000000.000000 >> >> So indeed, >> >> 64-bit hash< interning< 128 bit hash >> >> (At least on my Intel Nehalem Core i7 1.87GhZ) >> >> And the load of the global variable may in real life be hidden by other >> things going on in the function. >> >> And, you save vtable memory by having an interned char* and not saving the >> hash in the vtable. > > I'm OK with using the 64-bit hash with a macro to enable further > checking. If it becomes an issue, we can partition the vtable into two > separate structures (hash64/pointer/flags? + hash160/char*/metadata). > That's probably overkill. With an eye to security, perhaps the spec > should be sha1 (or sha2?, not sure if that ships with Python). No, I like splitting up the table, I was assuming we'd stick the char* in a different table anyway. Cache is precious, and the second table would be completely cold in most situations. Is the goal then to avoid having to have an interning registry? Something that hasn't come up so far is that Cython doesn't know the exact types of external typedefs, so it can't generate the hash at Cythonize-time. I guess some support for build systems to probe for type sizes and compute the signature hashes in a sepearate header file would solve this -- with a fallback to computing them runtime at module loading, if you're not using a supported build system. (But suddenly an interning registry doesn't look so horrible..) Really, I think a micro-benchmark is rather pessimistic about the performance of loading a global variable -- if more stuff happens around the call site then the load will likely be moved ahead and the latency hidden. Perhaps this might even be the case just for going the route through extensibletypeobject. >> They should be made more easily runnable so that we could run them on >> various systems, but it makes sense to first read up on and figure out which >> hash functions are really viable, to keep the number of numbers down. >> >> I just realized that I never pushed the changes I did to introduce >> -DIMHASH/-DIMID etc., but the benchmarks are pushed now. >> >> >> >>> We could also do a fallback table. Usually it'd be empty, Occasionally >>> it'd have one element in it. It'd always be possible to make this big >>> enough to avoid collisions in a worst-case scenario. >> >> >> If you do a fallback table it's as much code in the call site as linear >> probing... > > Is linear probing that bad? It's an extra increment and compare in the > miss case. > >> But when I played with the generation side, a failure to create a table at a >> given size would *always* be due to a single collision. This is what I did >> in the twoshift+fback benchmark. > > But it won't always be. One can always increase the size of the main > table however, if two collisions are rare enough. Yes of course, I didn't test 100% fill of a 64-entry table. I was more concerned with making the table 128 or 256 rather than having to go to 512 :-) >>> Duplicate tables works as long as there aren't too many orthogonal >>> considerations. Is the GIL the only one? What about "I can propagate >>> errors?" Now we're up to 4 tables... >> >> Would your decision of whether or not to dispatch to a function depend on >> whether or not it propagates errors? >> >> I'm thinking of the "with gil" function case, i.e. callee has: >> >> a) Function to call if you have the GIL >> b) GIL-acquiring wrapper >> >> and you want GIL-holding code to call a) and nogil code to call b). >> >> But one could just make the caller acquire the GIL if needed (which in that >> case is so expensive anyway that it can be made the unlikely() path). > > Are you saying you'd add code to the call site to determine if it > needs (and conditionally acquire) the GIL? Well, I'm saying it's an alternative, I'm not sure if it has merit. Basically shift the "with gil" responsibility to the caller in this case. > >> I can't think of other situations where you would pick which function to >> call based on flags. > > If the caller doesn't propagate errors, it may want to have different > codepaths depending on whether the callee propagates them. Not sure if I understand. Would you call a *different* incarnation of the callee depending on this, and need different function pointers for different callers? Otherwise you just check flags after the call and take the appropriate action, with a likely() around the likely one. You need flags, but not a different table. Dag From ian.h.bell at gmail.com Wed Jun 6 10:04:07 2012 From: ian.h.bell at gmail.com (Ian Bell) Date: Wed, 6 Jun 2012 01:04:07 -0700 Subject: [Cython] Resurrecting __dict__ for extension types Message-ID: As per a couple of discussions online ( http://mail.python.org/pipermail/cython-devel/2011-February/000122.html), it looks like at one point it was pretty close to being able to programmatically and automatically generate a __dict__ for extension types like for CPython classes. I have to manually code a function that does exactly what __dict__ should do, and it is a pain. I have some classes with tens of attributes, and that is already a big enough pain. This is especially useful to more easily enable deepcopy and pickling for classes. While on the pickling theme, it seems it really ought to be pretty straightforward to automatically pickle extension types. Don't you already have all the necessary information at compile time? This was on the wish list at one point if I am not mistaken and would be very useful to me and lots of other people. I'm finally loving coding in Cython and am finally making sense of how best to use extension types. Regards, Ian -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Wed Jun 6 10:58:37 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 06 Jun 2012 10:58:37 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> <4FCE5EFC.30407@astro.uio.no> Message-ID: <4FCF1BBD.9070709@behnel.de> mark florisson, 05.06.2012 22:33: > It doesn't even necessarily have to be about running user code, a user > could craft data input which causes such a situation. For instance, > let's say we have a just-in-time specializer which specializes a > function for the runtime input types, and the types depend on the user > input. For instance, if we write a web application we can post arrays > to described by a custom dtype, which draws pictures in some weird way > for us. We can get it to specialize pretty much any array type, so > that gives us a good opportunity to find collisions. Yes, and the bad thing is that a very high probability of having no collisions even in combination with the need for a huge amount of brute force work to find one is not enough. An attacker (or otherwise interested user) may just be lucky, and given how low in the application stack this will be used, such a bit of luck may have massive consequences. Stefan From d.s.seljebotn at astro.uio.no Wed Jun 6 11:11:15 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 06 Jun 2012 11:11:15 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCF1BBD.9070709@behnel.de> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> <4FCE5EFC.30407@astro.uio.no> <4FCF1BBD.9070709@behnel.de> Message-ID: <80b0aaa9-a1eb-4fba-b8fb-973766b20ed2@email.android.com> Stefan Behnel wrote: >mark florisson, 05.06.2012 22:33: >> It doesn't even necessarily have to be about running user code, a >user >> could craft data input which causes such a situation. For instance, >> let's say we have a just-in-time specializer which specializes a >> function for the runtime input types, and the types depend on the >user >> input. For instance, if we write a web application we can post arrays >> to described by a custom dtype, which draws pictures in some weird >way >> for us. We can get it to specialize pretty much any array type, so >> that gives us a good opportunity to find collisions. > >Yes, and the bad thing is that a very high probability of having no >collisions even in combination with the need for a huge amount of brute >force work to find one is not enough. An attacker (or otherwise >interested >user) may just be lucky, and given how low in the application stack >this >will be used, such a bit of luck may have massive consequences. Following that line of argument, I guess you keep your money in a mattress then? Our modern world is built around the assumption that people don't get *that* lucky. (I agree though that 64 bits is not enough for the security usecase! I'm just saying that 160 or 256 bits would be.) Dag > >Stefan >_______________________________________________ >cython-devel mailing list >cython-devel at python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From markflorisson88 at gmail.com Wed Jun 6 11:16:00 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Wed, 6 Jun 2012 10:16:00 +0100 Subject: [Cython] Hash-based vtables In-Reply-To: <80b0aaa9-a1eb-4fba-b8fb-973766b20ed2@email.android.com> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> <4FCE5EFC.30407@astro.uio.no> <4FCF1BBD.9070709@behnel.de> <80b0aaa9-a1eb-4fba-b8fb-973766b20ed2@email.android.com> Message-ID: On 6 June 2012 10:11, Dag Sverre Seljebotn wrote: > > > Stefan Behnel wrote: > >>mark florisson, 05.06.2012 22:33: >>> It doesn't even necessarily have to be about running user code, a >>user >>> could craft data input which causes such a situation. For instance, >>> let's say we have a just-in-time specializer which specializes a >>> function for the runtime input types, and the types depend on the >>user >>> input. For instance, if we write a web application we can post arrays >>> to described by a custom dtype, which draws pictures in some weird >>way >>> for us. We can get it to specialize pretty much any array type, so >>> that gives us a good opportunity to find collisions. >> >>Yes, and the bad thing is that a very high probability of having no >>collisions even in combination with the need for a huge amount of brute >>force work to find one is not enough. An attacker (or otherwise >>interested >>user) may just be lucky, and given how low in the application stack >>this >>will be used, such a bit of luck may have massive consequences. > > Following that line of argument, I guess you keep your money in a mattress then? Our modern world is built around the assumption that people don't get *that* lucky. > > (I agree though that 64 bits is not enough for the security usecase! I'm just saying that 160 or 256 bits would be.) > > Dag > I think we're arguing different things. You agree to the security problem, but Stefan was still emphasizing his old point. >> >>Stefan >>_______________________________________________ >>cython-devel mailing list >>cython-devel at python.org >>http://mail.python.org/mailman/listinfo/cython-devel > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From d.s.seljebotn at astro.uio.no Wed Jun 6 11:16:38 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 06 Jun 2012 11:16:38 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <80b0aaa9-a1eb-4fba-b8fb-973766b20ed2@email.android.com> References: <4FCD100B.7000008@astro.uio.no> <4FCDB478.3070000@behnel.de> <4FCE3B5F.9080603@astro.uio.no> <4FCE3D51.20009@astro.uio.no> <4FCE5EFC.30407@astro.uio.no> <4FCF1BBD.9070709@behnel.de> <80b0aaa9-a1eb-4fba-b8fb-973766b20ed2@email.android.com> Message-ID: <4FCF1FF6.8070807@astro.uio.no> On 06/06/2012 11:11 AM, Dag Sverre Seljebotn wrote: > > > Stefan Behnel wrote: > >> mark florisson, 05.06.2012 22:33: >>> It doesn't even necessarily have to be about running user code, a >> user >>> could craft data input which causes such a situation. For instance, >>> let's say we have a just-in-time specializer which specializes a >>> function for the runtime input types, and the types depend on the >> user >>> input. For instance, if we write a web application we can post arrays >>> to described by a custom dtype, which draws pictures in some weird >> way >>> for us. We can get it to specialize pretty much any array type, so >>> that gives us a good opportunity to find collisions. >> >> Yes, and the bad thing is that a very high probability of having no >> collisions even in combination with the need for a huge amount of brute >> force work to find one is not enough. An attacker (or otherwise >> interested >> user) may just be lucky, and given how low in the application stack >> this >> will be used, such a bit of luck may have massive consequences. > > Following that line of argument, I guess you keep your money in a mattress then? Our modern world is built around the assumption that people don't get *that* lucky. > > (I agree though that 64 bits is not enough for the security usecase! I'm just saying that 160 or 256 bits would be.) (And just to be clear, my current stance is in favour of using interning for the ID comparison, in the other head of this thread. I just couldn't resist Stefan's bait.) Dag From d.s.seljebotn at astro.uio.no Wed Jun 6 22:41:44 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 06 Jun 2012 22:41:44 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> Message-ID: <4FCFC088.3000709@astro.uio.no> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: > I just found http://cmph.sourceforge.net/ which looks quite > interesting. Though the resulting hash functions are supposedly cheap, > I have the feeling that branching is considered cheap in this context. Actually, this lead was *very* promising. I believe the very first reference I actually read through and didn't eliminate after the abstract totally swept away our home-grown solutions! "Hash & Displace" by Pagh (1999) is actually very simple, easy to understand, and fast both for generation and (the branch-free) lookup: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf The idea is: - Find a hash `g(x)` to partition the keys into `b` groups (the paper requires b > 2n, though I think in practice you can often get away with less) - Find a hash `f(x)` such that f is 1:1 within each group (which is easily achieved since groups only has a few elements) - For each group, from largest to smallest: Find a displacement `d[group]` so that `f(x) ^ d` doesn't cause collisions. It requires extra storage for the displacement table. However, I think 8 bits per element might suffice even for vtables of 512 or 1024 in size. Even with 16 bits it's rather negligible compared to the minimum-128-bit entries of the table. I benchmarked these hash functions: displace1: ((h >> r1) ^ d[h & 63]) & m1 displace2: ((h >> r1) ^ d[h & m2]) & m1 displace3: ((h >> r1) ^ d[(h >> r2) & m2]) & m1 Only the third one is truly in the spirit of the algorithm, but I think the first two should work well too (and when h is known compile-time, looking up d[h & 63] isn't harder than looking up r1 or m1). My computer is acting up and all my numbers today are slower than the earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, and yes, I've pinned the CPU speed). But here's today's numbers, compiled with -DIMHASH: direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=1800000000.000000 displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 I did a dirty prototype of the table-finder as well and it works: https://github.com/dagss/hashvtable/blob/master/pagh99.py Dag From d.s.seljebotn at astro.uio.no Wed Jun 6 22:57:37 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 06 Jun 2012 22:57:37 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCFC088.3000709@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> Message-ID: <4FCFC441.40703@astro.uio.no> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: > On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >> I just found http://cmph.sourceforge.net/ which looks quite >> interesting. Though the resulting hash functions are supposedly cheap, >> I have the feeling that branching is considered cheap in this context. > > Actually, this lead was *very* promising. I believe the very first > reference I actually read through and didn't eliminate after the > abstract totally swept away our home-grown solutions! > > "Hash & Displace" by Pagh (1999) is actually very simple, easy to > understand, and fast both for generation and (the branch-free) lookup: > > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf > > > The idea is: > > - Find a hash `g(x)` to partition the keys into `b` groups (the paper > requires b > 2n, though I think in practice you can often get away with > less) > > - Find a hash `f(x)` such that f is 1:1 within each group (which is > easily achieved since groups only has a few elements) > > - For each group, from largest to smallest: Find a displacement > `d[group]` so that `f(x) ^ d` doesn't cause collisions. > > It requires extra storage for the displacement table. However, I think 8 > bits per element might suffice even for vtables of 512 or 1024 in size. > Even with 16 bits it's rather negligible compared to the minimum-128-bit > entries of the table. > > I benchmarked these hash functions: > > displace1: ((h >> r1) ^ d[h & 63]) & m1 > displace2: ((h >> r1) ^ d[h & m2]) & m1 > displace3: ((h >> r1) ^ d[(h >> r2) & m2]) & m1 > > Only the third one is truly in the spirit of the algorithm, but I think > the first two should work well too (and when h is known compile-time, > looking up d[h & 63] isn't harder than looking up r1 or m1). > > My computer is acting up and all my numbers today are slower than the > earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, > and yes, I've pinned the CPU speed). But here's today's numbers, > compiled with -DIMHASH: > > direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 > index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 > twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 > threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=1800000000.000000 > displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 > displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 > displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 > > > I did a dirty prototype of the table-finder as well and it works: > > https://github.com/dagss/hashvtable/blob/master/pagh99.py The paper obviously puts more effort on minimizing table size and not a fast lookup. My hunch is that our choice should be ((h >> table.r) ^ table.d[h & m2]) & m1 and use 8-bits d (because even if you have 1024 methods, you'd rather double the number of bins than those 2 extra bits available for displacement options). Then keep incrementing the size of d and the number of table slots (in such an order that the total vtable size is minimized) until success. In practice this should almost always just increase the size of d, and keep the table size at the lowest 2**k that fits the slots (even for 64 methods or 128 methods :-)) Essentially we avoid the shift in the argument to d[] by making d larger. Dag From robertwb at gmail.com Wed Jun 6 23:00:58 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Wed, 6 Jun 2012 14:00:58 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCE7CFB.7000205@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <4FCE679C.7000002@astro.uio.no> <4FCE7CFB.7000205@astro.uio.no> Message-ID: On Tue, Jun 5, 2012 at 2:41 PM, Dag Sverre Seljebotn wrote: > On 06/05/2012 10:50 PM, Robert Bradshaw wrote: >> >> On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn >> ?wrote: >>> >>> On 06/04/2012 11:43 PM, Robert Bradshaw wrote: >>>> >>>> >>>> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn >>>> ? ?wrote: >>>>> >>>>> >>>>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>>>>> >>>>>> >>>>>> >>>>>> Me and Robert had a long discussion on the NumFOCUS list about this >>>>>> already, but I figured it was better to continue it and provide more >>>>>> in-depth benchmark results here. >>>>>> >>>>>> It's basically a new idea of how to provide a vtable based on perfect >>>>>> hashing, which should be a lot simpler to implement than what I first >>>>>> imagined. >>>>>> >>>>>> I'll write down some context first, if you're familiar with this >>>>>> skip ahead a bit.. >>>>>> >>>>>> This means that you can do fast dispatches *without* the messy >>>>>> business of binding vtable slots at compile time. To be concrete, this >>>>>> might e.g. take the form >>>>>> >>>>>> def f(obj): >>>>>> obj.method(3.4) # try to find a vtable with "void method(double)" in >>>>>> it >>>>>> >>>>>> or, a more typed approach, >>>>>> >>>>>> # File A >>>>>> cdef class MyImpl: >>>>>> def double method(double x): return x * x >>>>>> >>>>>> # File B >>>>>> # Here we never know about MyImpl, hence "duck-typed" >>>>>> @cython.interface >>>>>> class MyIntf: >>>>>> def double method(double x): pass >>>>>> >>>>>> def f(MyIntf obj): >>>>>> # obj *can* be MyImpl instance, or whatever else that supports >>>>>> # that interface >>>>>> obj.method(3.4) >>>>>> >>>>>> >>>>>> Now, the idea to implement this is: >>>>>> >>>>>> a) Both caller and callee pre-hash name/argument string >>>>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>>>>> md5) >>>>>> >>>>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>>>>> hashing. What you do is define a final hash fh as a function >>>>>> of the pre-hash ph, for instance >>>>>> >>>>>> fh = ((ph>> ? ?vtable.r1) ^ (ph>> ? ?vtable.r2) ^ (ph>> >>>>>> ?vtable.r3))& >>>>>> vtable.m >>>>>> >>>>>> (Me and Robert are benchmarking different functions to use here.) By >>>>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and >>>>>> will be able to pick a combination which gives *no* (or very few) >>>>>> collisions. >>>>>> >>>>>> c) Caller then combines the pre-hash generated at compile-time, with >>>>>> r1, r2, r3, m stored in the vtable header, in order to find the >>>>>> final location in the hash-table. >>>>>> >>>>>> The exciting thing is that in benchmark, the performance penalty is >>>>>> actually very slight over a C++-style v-table. (Of course you can >>>>>> cache a proper vtable, but the fact that you get so close without >>>>>> caring about caching means that this can be done much faster.) >>>> >>>> >>>> >>>> One advantage about caching a vtable is that one can possibly put in >>>> adapters for non-exact matches. It also opens up the possibility of >>>> putting in stubs to call def methods if they exist. This needs to be >>>> fleshed out more, (another CEP :) but could provide for a >>>> backwards-compatible easy first implementation. >>>> >>>>>> Back to my and Robert's discussion on benchmarks: >>>>>> >>>>>> I've uploaded benchmarks here: >>>>>> >>>>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>>>>> >>>>>> I've changed the benchmark taking to give more robust numbers (at >>>>>> least for me), you want to look at the 'min' column. >>>>>> >>>>>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>>>>> So we don't pass 'h' on the stack, but either a) looks it up in a >>>>>> global >>>>>> variable (default), or b) it's a compile-time constant (immediate in >>>>>> assembly) (compile with -DIMHASH). >>>>>> >>>>>> Similarly, the ID is either an "interned" global variable, or an >>>>>> immediate (-DIMID). >>>>>> >>>>>> The results are very different on my machine depending on this aspect. >>>>>> My conclusions: >>>>>> >>>>>> - Both three shifts with masking, two shifts with a "fallback slot" >>>>>> (allowing for a single collision), three shifts, two shifts with >>>>>> two masks allows for constructing good vtables. In the case of only >>>>>> two shifts, one colliding method gets the twoshift+fback >>>>>> performance and the rest gets the twoshift performance. >>>>>> >>>>>> - Performance is really more affected by whether hashes are >>>>>> immediates or global variables than the hash function. This is in >>>>>> contrast to the interning vs. key benchmarks -- so I think that if >>>>>> we looked up the vtable through PyTypeObject, rather than getting >>>>>> the vtable directly, the loads of the global variables could >>>>>> potentially be masked by that. >>>>>> >>>>>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>>>>> and the ID-ing (don't bother with any interning), and compile the >>>>>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>>>> probability is incredibly small, b) it would only matter in >>>>>> situations that should cause an AttributeError anyway, c) if we >>>>>> really care, we can always use an interning-like mechanism to >>>>>> validate on module loading that its hashes doesn't collide with >>>>>> other hashes (and raise an exception "Congratulations, you've >>>>>> discovered a phenomenal md5 collision, get in touch with cython >>>>>> devs and we'll work around it right away"). >>>> >>>> >>>> >>>> Due to the birthday paradox, this seems a bit risky. Maybe it's >>>> because I regularly work with collections much bigger than 2^32, and I >>>> suppose we're talking about unique method names and signatures here, >>>> but still... I wonder what the penalty would be for checking the full >>>> 128 bit hash. (Storing it could allow for greater entropy in the >>>> optimal hash table search as well). >>> >>> >>> >>> Wonder no more. Here's the penalty for different bit-lengths, all >>> compile-time constants: >>> >>> ? ? threeshift: min=6.08e-09 ?mean=6.11e-09 ?std=2.81e-11 >>> val=1200000000.000000 >>> ? threeshift96: min=7.53e-09 ?mean=7.55e-09 ?std=1.96e-11 >>> val=1200000000.000000 >>> ?threeshift128: min=6.95e-09 ?mean=6.97e-09 ?std=2.57e-11 >>> val=1200000000.000000 >>> ?threeshift160: min=8.17e-09 ?mean=8.23e-09 ?std=4.06e-11 >>> val=1200000000.000000 >>> >>> And for comparison, when loading the comparison IDs from global variable: >>> >>> ? ? threeshift: min=6.46e-09 ?mean=6.52e-09 ?std=4.95e-11 >>> val=1200000000.000000 >>> ? threeshift96: min=8.07e-09 ?mean=8.16e-09 ?std=4.55e-11 >>> val=1200000000.000000 >>> ?threeshift128: min=8.06e-09 ?mean=8.18e-09 ?std=6.71e-11 >>> val=1200000000.000000 >>> ?threeshift160: min=9.71e-09 ?mean=9.83e-09 ?std=5.12e-11 >>> val=1200000000.000000 >>> >>> So indeed, >>> >>> 64-bit hash< ?interning< ?128 bit hash >>> >>> (At least on my Intel Nehalem Core i7 1.87GhZ) >>> >>> And the load of the global variable may in real life be hidden by other >>> things going on in the function. >>> >>> And, you save vtable memory by having an interned char* and not saving >>> the >>> hash in the vtable. >> >> >> I'm OK with using the 64-bit hash with a macro to enable further >> checking. If it becomes an issue, we can partition the vtable into two >> separate structures (hash64/pointer/flags? + hash160/char*/metadata). >> That's probably overkill. With an eye to security, perhaps the spec >> should be sha1 (or sha2?, not sure if that ships with Python). > > > No, I like splitting up the table, I was assuming we'd stick the char* in a > different table anyway. Cache is precious, and the second table would be > completely cold in most situations. > > Is the goal then to avoid having to have an interning registry? Yes, and to avoid invoking an expensive hash function at runtime in order to achieve good distribution. > Something that hasn't come up so far is that Cython doesn't know the exact > types of external typedefs, so it can't generate the hash at Cythonize-time. > I guess some support for build systems to probe for type sizes and compute > the signature hashes in a sepearate header file would solve this -- with a > fallback to computing them runtime at module loading, if you're not using a > supported build system. (But suddenly an interning registry doesn't look so > horrible..) It all depends on how strict you want to be. It may be acceptable to let f(int) and f(long) not hash to the same value even if sizeof(int) == sizeof(long). We could also promote all int types to long or long long, including extern times (assuming, with a c-compile-time check, external types declared up to "long" are <= sizeof(long)). Another option is to let the hash be md5(sig) + hashN(sizeof(extern_arg1), sizeof(extern_argN)) where hashN is a macro. > Really, I think a micro-benchmark is rather pessimistic about the > performance of loading a global variable -- if more stuff happens around the > call site then the load will likely be moved ahead and the latency hidden. > Perhaps this might even be the case just for going the route through > extensibletypeobject. > > >>> They should be made more easily runnable so that we could run them on >>> various systems, but it makes sense to first read up on and figure out >>> which >>> hash functions are really viable, to keep the number of numbers down. >>> >>> I just realized that I never pushed the changes I did to introduce >>> -DIMHASH/-DIMID etc., but the benchmarks are pushed now. >>> >>> >>> >>>> We could also do a fallback table. Usually it'd be empty, Occasionally >>>> it'd have one element in it. It'd always be possible to make this big >>>> enough to avoid collisions in a worst-case scenario. >>> >>> >>> >>> If you do a fallback table it's as much code in the call site as linear >>> probing... >> >> >> Is linear probing that bad? It's an extra increment and compare in the >> miss case. >> >>> But when I played with the generation side, a failure to create a table >>> at a >>> given size would *always* be due to a single collision. This is what I >>> did >>> in the twoshift+fback benchmark. >> >> >> But it won't always be. One can always increase the size of the main >> table however, if two collisions are rare enough. > > > Yes of course, I didn't test 100% fill of a 64-entry table. I was more > concerned with making the table 128 or 256 rather than having to go to 512 > :-) > > >>>> Duplicate tables works as long as there aren't too many orthogonal >>>> considerations. Is the GIL the only one? What about "I can propagate >>>> errors?" Now we're up to 4 tables... >>> >>> >>> Would your decision of whether or not to dispatch to a function depend on >>> whether or not it propagates errors? >>> >>> I'm thinking of the "with gil" function case, i.e. callee has: >>> >>> ?a) Function to call if you have the GIL >>> ?b) GIL-acquiring wrapper >>> >>> and you want GIL-holding code to call a) and nogil code to call b). >>> >>> But one could just make the caller acquire the GIL if needed (which in >>> that >>> case is so expensive anyway that it can be made the unlikely() path). >> >> >> Are you saying you'd add code to the call site to determine if it >> needs (and conditionally acquire) the GIL? > > > Well, I'm saying it's an alternative, I'm not sure if it has merit. > Basically shift the "with gil" responsibility to the caller in this case. > > >> >>> I can't think of other situations where you would pick which function to >>> call based on flags. >> >> >> If the caller doesn't propagate errors, it may want to have different >> codepaths depending on whether the callee propagates them. > > > Not sure if I understand. Would you call a *different* incarnation of the > callee depending on this, and need different function pointers for different > callers? > > Otherwise you just check flags after the call and take the appropriate > action, with a likely() around the likely one. You need flags, but not a > different table. Fair enough. - Robert From robertwb at gmail.com Wed Jun 6 23:16:56 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Wed, 6 Jun 2012 14:16:56 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCFC441.40703@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> Message-ID: On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn wrote: > On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >> >> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>> >>> I just found http://cmph.sourceforge.net/ which looks quite >>> interesting. Though the resulting hash functions are supposedly cheap, >>> I have the feeling that branching is considered cheap in this context. >> >> >> Actually, this lead was *very* promising. I believe the very first >> reference I actually read through and didn't eliminate after the >> abstract totally swept away our home-grown solutions! >> >> "Hash & Displace" by Pagh (1999) is actually very simple, easy to >> understand, and fast both for generation and (the branch-free) lookup: >> >> >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >> >> >> The idea is: >> >> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >> requires b > 2n, though I think in practice you can often get away with >> less) >> >> - Find a hash `f(x)` such that f is 1:1 within each group (which is >> easily achieved since groups only has a few elements) >> >> - For each group, from largest to smallest: Find a displacement >> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >> >> It requires extra storage for the displacement table. However, I think 8 >> bits per element might suffice even for vtables of 512 or 1024 in size. >> Even with 16 bits it's rather negligible compared to the minimum-128-bit >> entries of the table. >> >> I benchmarked these hash functions: >> >> displace1: ((h >> r1) ^ d[h & 63]) & m1 >> displace2: ((h >> r1) ^ d[h & m2]) & m1 >> displace3: ((h >> r1) ^ d[(h >> r2) & m2]) & m1 >> >> Only the third one is truly in the spirit of the algorithm, but I think >> the first two should work well too (and when h is known compile-time, >> looking up d[h & 63] isn't harder than looking up r1 or m1). >> >> My computer is acting up and all my numbers today are slower than the >> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, >> and yes, I've pinned the CPU speed). But here's today's numbers, >> compiled with -DIMHASH: >> >> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 >> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=1800000000.000000 >> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 >> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 >> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 >> >> >> I did a dirty prototype of the table-finder as well and it works: >> >> https://github.com/dagss/hashvtable/blob/master/pagh99.py > > > The paper obviously puts more effort on minimizing table size and not a fast > lookup. My hunch is that our choice should be > > ((h >> table.r) ^ table.d[h & m2]) & m1 > > and use 8-bits d (because even if you have 1024 methods, you'd rather double > the number of bins than those 2 extra bits available for displacement > options). > > Then keep incrementing the size of d and the number of table slots (in such > an order that the total vtable size is minimized) until success. In practice > this should almost always just increase the size of d, and keep the table > size at the lowest 2**k that fits the slots (even for 64 methods or 128 > methods :-)) > > Essentially we avoid the shift in the argument to d[] by making d larger. Nice. I'm surprised that the indirection on d doesn't cost us much; hopefully its size wouldn't be a big issue either. What kinds of densities were you achieving? Going back to the idea of linear probing on a cache miss, this has the advantage that one can write a brain-dead provider that sets m=0 and simply lists the methods instead of requiring a table optimizer. (Most tools, of course, would do the table optimization.) It also lets you get away with a "kind-of good" hash rather than requiring you search until you find a (larger?) perfect one. - Robert From d.s.seljebotn at astro.uio.no Wed Jun 6 23:36:09 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 06 Jun 2012 23:36:09 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> Message-ID: <4FCFCD49.9030802@astro.uio.no> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: > On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn > wrote: >> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>> >>>> I just found http://cmph.sourceforge.net/ which looks quite >>>> interesting. Though the resulting hash functions are supposedly cheap, >>>> I have the feeling that branching is considered cheap in this context. >>> >>> >>> Actually, this lead was *very* promising. I believe the very first >>> reference I actually read through and didn't eliminate after the >>> abstract totally swept away our home-grown solutions! >>> >>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>> understand, and fast both for generation and (the branch-free) lookup: >>> >>> >>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>> >>> >>> The idea is: >>> >>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >>> requires b> 2n, though I think in practice you can often get away with >>> less) >>> >>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>> easily achieved since groups only has a few elements) >>> >>> - For each group, from largest to smallest: Find a displacement >>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>> >>> It requires extra storage for the displacement table. However, I think 8 >>> bits per element might suffice even for vtables of 512 or 1024 in size. >>> Even with 16 bits it's rather negligible compared to the minimum-128-bit >>> entries of the table. >>> >>> I benchmarked these hash functions: >>> >>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>> >>> Only the third one is truly in the spirit of the algorithm, but I think >>> the first two should work well too (and when h is known compile-time, >>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>> >>> My computer is acting up and all my numbers today are slower than the >>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, >>> and yes, I've pinned the CPU speed). But here's today's numbers, >>> compiled with -DIMHASH: >>> >>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 >>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=1800000000.000000 >>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 >>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 >>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 >>> >>> >>> I did a dirty prototype of the table-finder as well and it works: >>> >>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >> >> >> The paper obviously puts more effort on minimizing table size and not a fast >> lookup. My hunch is that our choice should be >> >> ((h>> table.r) ^ table.d[h& m2])& m1 >> >> and use 8-bits d (because even if you have 1024 methods, you'd rather double >> the number of bins than those 2 extra bits available for displacement >> options). >> >> Then keep incrementing the size of d and the number of table slots (in such >> an order that the total vtable size is minimized) until success. In practice >> this should almost always just increase the size of d, and keep the table >> size at the lowest 2**k that fits the slots (even for 64 methods or 128 >> methods :-)) >> >> Essentially we avoid the shift in the argument to d[] by making d larger. > > Nice. I'm surprised that the indirection on d doesn't cost us much; Well, table->d[const & const] compiles down to the same kind of code as table->m1. I guess I'm surprised too that displace2 doesn't penalize. > hopefully its size wouldn't be a big issue either. What kinds of > densities were you achieving? The algorithm is designed for 100% density in the table itself. (We can lift that to compensate for a small space of possible hash functions I guess.) I haven't done proper simulations yet, but I just tried |vtable|=128, |d|=128 from the command line and I had 15 successes or so before the first failure. That's with a 100% density in the vtable itself! (And when it fails, you increase |d| to get your success). The caveat is the space spent on d (it's small in comparison, but that's why this isn't too good to be true). A disadvantage might be that we may no longer have the opportunity to not make the table size a power of two (i.e. replace the mask with "if (likely(slot < n))"). I think for that to work one would need to replace the xor group with addition on Z_d. > Going back to the idea of linear probing on a cache miss, this has the > advantage that one can write a brain-dead provider that sets m=0 and > simply lists the methods instead of requiring a table optimizer. (Most > tools, of course, would do the table optimization.) It also lets you > get away with a "kind-of good" hash rather than requiring you search > until you find a (larger?) perfect one. Well, given that we can have 100% density, and generating the table is lightning fast, and the C code to generate the table is likely a 300 line utility... I'm not convinced. We should however make sure that *callers* can do a linear scan and use strcmp if they don't care about performance. Dag From d.s.seljebotn at astro.uio.no Thu Jun 7 00:03:57 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 00:03:57 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCFCD49.9030802@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> Message-ID: Dag Sverre Seljebotn wrote: >On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>> >>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>> >>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>> interesting. Though the resulting hash functions are supposedly >cheap, >>>>> I have the feeling that branching is considered cheap in this >context. >>>> >>>> >>>> Actually, this lead was *very* promising. I believe the very first >>>> reference I actually read through and didn't eliminate after the >>>> abstract totally swept away our home-grown solutions! >>>> >>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>> understand, and fast both for generation and (the branch-free) >lookup: >>>> >>>> >>>> >http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>> >>>> >>>> The idea is: >>>> >>>> - Find a hash `g(x)` to partition the keys into `b` groups (the >paper >>>> requires b> 2n, though I think in practice you can often get away >with >>>> less) >>>> >>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>> easily achieved since groups only has a few elements) >>>> >>>> - For each group, from largest to smallest: Find a displacement >>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>> >>>> It requires extra storage for the displacement table. However, I >think 8 >>>> bits per element might suffice even for vtables of 512 or 1024 in >size. >>>> Even with 16 bits it's rather negligible compared to the >minimum-128-bit >>>> entries of the table. >>>> >>>> I benchmarked these hash functions: >>>> >>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>> >>>> Only the third one is truly in the spirit of the algorithm, but I >think >>>> the first two should work well too (and when h is known >compile-time, >>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>> >>>> My computer is acting up and all my numbers today are slower than >the >>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >ago, >>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>> compiled with -DIMHASH: >>>> >>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 >val=2400000000.000000 >>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 >val=1800000000.000000 >>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >val=1800000000.000000 >>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >val=1800000000.000000 >>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >val=1800000000.000000 >>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >val=1800000000.000000 >>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >val=1800000000.000000 >>>> >>>> >>>> I did a dirty prototype of the table-finder as well and it works: >>>> >>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>> >>> >>> The paper obviously puts more effort on minimizing table size and >not a fast >>> lookup. My hunch is that our choice should be >>> >>> ((h>> table.r) ^ table.d[h& m2])& m1 >>> >>> and use 8-bits d (because even if you have 1024 methods, you'd >rather double >>> the number of bins than those 2 extra bits available for >displacement >>> options). >>> >>> Then keep incrementing the size of d and the number of table slots >(in such >>> an order that the total vtable size is minimized) until success. In >practice >>> this should almost always just increase the size of d, and keep the >table >>> size at the lowest 2**k that fits the slots (even for 64 methods or >128 >>> methods :-)) >>> >>> Essentially we avoid the shift in the argument to d[] by making d >larger. >> >> Nice. I'm surprised that the indirection on d doesn't cost us much; > >Well, table->d[const & const] compiles down to the same kind of code as > >table->m1. I guess I'm surprised too that displace2 doesn't penalize. > >> hopefully its size wouldn't be a big issue either. What kinds of >> densities were you achieving? > >The algorithm is designed for 100% density in the table itself. (We can > >lift that to compensate for a small space of possible hash functions I >guess.) > >I haven't done proper simulations yet, but I just tried |vtable|=128, >|d|=128 from the command line and I had 15 successes or so before the >first failure. That's with a 100% density in the vtable itself! (And >when it fails, you increase |d| to get your success). > >The caveat is the space spent on d (it's small in comparison, but >that's >why this isn't too good to be true). > >A disadvantage might be that we may no longer have the opportunity to >not make the table size a power of two (i.e. replace the mask with "if >(likely(slot < n))"). I think for that to work one would need to >replace >the xor group with addition on Z_d. Strike this paragraph; don't know what I was thinking... Dag > >> Going back to the idea of linear probing on a cache miss, this has >the >> advantage that one can write a brain-dead provider that sets m=0 and >> simply lists the methods instead of requiring a table optimizer. >(Most >> tools, of course, would do the table optimization.) It also lets you >> get away with a "kind-of good" hash rather than requiring you search >> until you find a (larger?) perfect one. > >Well, given that we can have 100% density, and generating the table is >lightning fast, and the C code to generate the table is likely a 300 >line utility... I'm not convinced. > >We should however make sure that *callers* can do a linear scan and use > >strcmp if they don't care about performance. > >Dag >_______________________________________________ >cython-devel mailing list >cython-devel at python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From robertwb at gmail.com Thu Jun 7 00:26:42 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Wed, 6 Jun 2012 15:26:42 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FCFCD49.9030802@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> Message-ID: On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn wrote: > On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >> >> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >> ?wrote: >>> >>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>> >>>> >>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>> >>>>> >>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>> interesting. Though the resulting hash functions are supposedly cheap, >>>>> I have the feeling that branching is considered cheap in this context. >>>> >>>> >>>> >>>> Actually, this lead was *very* promising. I believe the very first >>>> reference I actually read through and didn't eliminate after the >>>> abstract totally swept away our home-grown solutions! >>>> >>>> "Hash& ?Displace" by Pagh (1999) is actually very simple, easy to >>>> >>>> understand, and fast both for generation and (the branch-free) lookup: >>>> >>>> >>>> >>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>> >>>> >>>> The idea is: >>>> >>>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >>>> requires b> ?2n, though I think in practice you can often get away with >>>> less) >>>> >>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>> easily achieved since groups only has a few elements) >>>> >>>> - For each group, from largest to smallest: Find a displacement >>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>> >>>> It requires extra storage for the displacement table. However, I think 8 >>>> bits per element might suffice even for vtables of 512 or 1024 in size. >>>> Even with 16 bits it's rather negligible compared to the minimum-128-bit >>>> entries of the table. >>>> >>>> I benchmarked these hash functions: >>>> >>>> displace1: ((h>> ?r1) ^ d[h& ?63])& ?m1 >>>> displace2: ((h>> ?r1) ^ d[h& ?m2])& ?m1 >>>> displace3: ((h>> ?r1) ^ d[(h>> ?r2)& ?m2])& ?m1 >>>> >>>> >>>> Only the third one is truly in the spirit of the algorithm, but I think >>>> the first two should work well too (and when h is known compile-time, >>>> looking up d[h& ?63] isn't harder than looking up r1 or m1). >>>> >>>> >>>> My computer is acting up and all my numbers today are slower than the >>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, >>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>> compiled with -DIMHASH: >>>> >>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 >>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>> val=1800000000.000000 >>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 >>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 >>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 >>>> >>>> >>>> I did a dirty prototype of the table-finder as well and it works: >>>> >>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>> >>> >>> >>> The paper obviously puts more effort on minimizing table size and not a >>> fast >>> lookup. My hunch is that our choice should be >>> >>> ((h>> ?table.r) ^ table.d[h& ?m2])& ?m1 >>> >>> >>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>> double >>> the number of bins than those 2 extra bits available for displacement >>> options). >>> >>> Then keep incrementing the size of d and the number of table slots (in >>> such >>> an order that the total vtable size is minimized) until success. In >>> practice >>> this should almost always just increase the size of d, and keep the table >>> size at the lowest 2**k that fits the slots (even for 64 methods or 128 >>> methods :-)) >>> >>> Essentially we avoid the shift in the argument to d[] by making d larger. >> >> >> Nice. I'm surprised that the indirection on d doesn't cost us much; > > > Well, table->d[const & const] compiles down to the same kind of code as > table->m1. I guess I'm surprised too that displace2 doesn't penalize. > > >> hopefully its size wouldn't be a big issue either. What kinds of >> densities were you achieving? > > > The algorithm is designed for 100% density in the table itself. (We can lift > that to compensate for a small space of possible hash functions I guess.) > > I haven't done proper simulations yet, but I just tried |vtable|=128, > |d|=128 from the command line and I had 15 successes or so before the first > failure. That's with a 100% density in the vtable itself! (And when it > fails, you increase |d| to get your success). > > The caveat is the space spent on d (it's small in comparison, but that's why > this isn't too good to be true). > > A disadvantage might be that we may no longer have the opportunity to not > make the table size a power of two (i.e. replace the mask with "if > (likely(slot < n))"). I think for that to work one would need to replace the > xor group with addition on Z_d. > > >> Going back to the idea of linear probing on a cache miss, this has the >> advantage that one can write a brain-dead provider that sets m=0 and >> simply lists the methods instead of requiring a table optimizer. (Most >> tools, of course, would do the table optimization.) It also lets you >> get away with a "kind-of good" hash rather than requiring you search >> until you find a (larger?) perfect one. > > > Well, given that we can have 100% density, and generating the table is > lightning fast, and the C code to generate the table is likely a 300 line > utility... I'm not convinced. It goes from an extraordinary simple spec (table is, at minimum, a func[2^k] with a couple of extra zero fields, whose struct can be statically defined in the source by hand) to a, well, not complicated in the absolute sense, but much more so than the definition above. It also is variable-size which makes allocating it globally/on a stack a pain (I suppose one can choose an upper bound for |d| and |vtable|). I am a bit playing devil's advocate here, it's probably just a (minor) con, but worth noting at least. > We should however make sure that *callers* can do a linear scan and use > strcmp if they don't care about performance. Yeah. That's easier to ensure ;). - Robert From dieter at handshake.de Thu Jun 7 10:44:09 2012 From: dieter at handshake.de (Dieter Maurer) Date: Thu, 7 Jun 2012 10:44:09 +0200 Subject: [Cython] Bug: bad C code generated for (some) "... and ... or ..." expressions Message-ID: <20432.27097.470830.218794@localhost.localdomain> "cython 0.13" generates bad C code for the attached "pyx" file. "cython" itself recognizes that it did something wrong and emits ";" to the generated file: ... static __pyx_t_12cybug_and_or_pointer __pyx_f_12cybug_and_or_bug(PyObject *__pyx_v_o) { __pyx_t_12cybug_and_or_pointer __pyx_r; int __pyx_t_1; __pyx_t_12cybug_and_or_pointer __pyx_t_2; ; ; ... -------------- next part -------------- A non-text attachment was scrubbed... Name: cybug_and_or.pyx Type: text/x-cython Size: 164 bytes Desc: "cython" source file triggering bad C code generation URL: -------------- next part -------------- The error probably happens because it is difficult for "cython" to determine the type for "and" and "or" expressions (if the operand types differ). In an "cond and t or f" expression, however, the result type is "type(t)" if "type(t) == type(f)", independent of "type(cond)". It might not be worse to special case this type of expressions. It would however be more friendly to output an instructive error message instead of generating bad C code. -- Dieter From d.s.seljebotn at astro.uio.no Thu Jun 7 12:20:59 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 12:20:59 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> Message-ID: <4FD0808B.5080300@astro.uio.no> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: > On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn > wrote: >> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>> >>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>> wrote: >>>> >>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>> >>>>> >>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>> >>>>>> >>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>> interesting. Though the resulting hash functions are supposedly cheap, >>>>>> I have the feeling that branching is considered cheap in this context. >>>>> >>>>> >>>>> >>>>> Actually, this lead was *very* promising. I believe the very first >>>>> reference I actually read through and didn't eliminate after the >>>>> abstract totally swept away our home-grown solutions! >>>>> >>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>> >>>>> understand, and fast both for generation and (the branch-free) lookup: >>>>> >>>>> >>>>> >>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>> >>>>> >>>>> The idea is: >>>>> >>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >>>>> requires b> 2n, though I think in practice you can often get away with >>>>> less) >>>>> >>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>> easily achieved since groups only has a few elements) >>>>> >>>>> - For each group, from largest to smallest: Find a displacement >>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>> >>>>> It requires extra storage for the displacement table. However, I think 8 >>>>> bits per element might suffice even for vtables of 512 or 1024 in size. >>>>> Even with 16 bits it's rather negligible compared to the minimum-128-bit >>>>> entries of the table. >>>>> >>>>> I benchmarked these hash functions: >>>>> >>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>> >>>>> >>>>> Only the third one is truly in the spirit of the algorithm, but I think >>>>> the first two should work well too (and when h is known compile-time, >>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>> >>>>> >>>>> My computer is acting up and all my numbers today are slower than the >>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, >>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>> compiled with -DIMHASH: >>>>> >>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=1800000000.000000 >>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>> val=1800000000.000000 >>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=1800000000.000000 >>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=1800000000.000000 >>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=1800000000.000000 >>>>> >>>>> >>>>> I did a dirty prototype of the table-finder as well and it works: >>>>> >>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>> >>>> >>>> >>>> The paper obviously puts more effort on minimizing table size and not a >>>> fast >>>> lookup. My hunch is that our choice should be >>>> >>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>> >>>> >>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>> double >>>> the number of bins than those 2 extra bits available for displacement >>>> options). >>>> >>>> Then keep incrementing the size of d and the number of table slots (in >>>> such >>>> an order that the total vtable size is minimized) until success. In >>>> practice >>>> this should almost always just increase the size of d, and keep the table >>>> size at the lowest 2**k that fits the slots (even for 64 methods or 128 >>>> methods :-)) >>>> >>>> Essentially we avoid the shift in the argument to d[] by making d larger. >>> >>> >>> Nice. I'm surprised that the indirection on d doesn't cost us much; >> >> >> Well, table->d[const& const] compiles down to the same kind of code as >> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >> >> >>> hopefully its size wouldn't be a big issue either. What kinds of >>> densities were you achieving? OK, simulation results just in (for the displace2 hash), and they exceeded my expectations. I always fill the table with n=2^k keys, and fix b = n (b means |d|). Then the failure rates are (top two are 100,000 simulations, the rest are 1000 simulations): n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 Try-mean and try-max is how many r's needed to be tried before success, so it gives an indication how much is left before failure. For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to let b=2*n (100,000 simulations): n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit displacements they mostly failed. So we either need to make each element of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits displacements. The algorithm is rather fast and concise: https://github.com/dagss/hashvtable/blob/master/pagh99.py >> The algorithm is designed for 100% density in the table itself. (We can lift >> that to compensate for a small space of possible hash functions I guess.) >> >> I haven't done proper simulations yet, but I just tried |vtable|=128, >> |d|=128 from the command line and I had 15 successes or so before the first >> failure. That's with a 100% density in the vtable itself! (And when it >> fails, you increase |d| to get your success). >> >> The caveat is the space spent on d (it's small in comparison, but that's why >> this isn't too good to be true). >> >> A disadvantage might be that we may no longer have the opportunity to not >> make the table size a power of two (i.e. replace the mask with "if >> (likely(slot< n))"). I think for that to work one would need to replace the >> xor group with addition on Z_d. >> >> >>> Going back to the idea of linear probing on a cache miss, this has the >>> advantage that one can write a brain-dead provider that sets m=0 and >>> simply lists the methods instead of requiring a table optimizer. (Most >>> tools, of course, would do the table optimization.) It also lets you >>> get away with a "kind-of good" hash rather than requiring you search >>> until you find a (larger?) perfect one. >> >> >> Well, given that we can have 100% density, and generating the table is >> lightning fast, and the C code to generate the table is likely a 300 line >> utility... I'm not convinced. > > It goes from an extraordinary simple spec (table is, at minimum, a > func[2^k] with a couple of extra zero fields, whose struct can be > statically defined in the source by hand) to a, well, not complicated > in the absolute sense, but much more so than the definition above. It > also is variable-size which makes allocating it globally/on a stack a > pain (I suppose one can choose an upper bound for |d| and |vtable|). > > I am a bit playing devil's advocate here, it's probably just a (minor) > con, but worth noting at least. If you were willing to go the interning route, so that you didn't need to fill the table with md5 hashes anyway, I'd say you'd have a stronger point :-) Given the results above, static allocation can at least be solved in a way that is probably user-friendly enough: PyHashVTable_16_16 mytable; ...init () { mytable.functions = { ... }; if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; } Now, with chance ~1/1000, you're going to get an exception saying "Please try PyHashVTable_16_32". (And since that's deterministic given the function definitions you always catch it at once.) Dag From d.s.seljebotn at astro.uio.no Thu Jun 7 12:35:37 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 12:35:37 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD0808B.5080300@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> Message-ID: <4FD083F9.2030006@astro.uio.no> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: > On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>> >>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>> wrote: >>>>> >>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>> >>>>>> >>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>> >>>>>>> >>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>> cheap, >>>>>>> I have the feeling that branching is considered cheap in this >>>>>>> context. >>>>>> >>>>>> >>>>>> >>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>> reference I actually read through and didn't eliminate after the >>>>>> abstract totally swept away our home-grown solutions! >>>>>> >>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>> >>>>>> understand, and fast both for generation and (the branch-free) >>>>>> lookup: >>>>>> >>>>>> >>>>>> >>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>> >>>>>> >>>>>> >>>>>> The idea is: >>>>>> >>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >>>>>> requires b> 2n, though I think in practice you can often get away >>>>>> with >>>>>> less) >>>>>> >>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>> easily achieved since groups only has a few elements) >>>>>> >>>>>> - For each group, from largest to smallest: Find a displacement >>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>> >>>>>> It requires extra storage for the displacement table. However, I >>>>>> think 8 >>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>> size. >>>>>> Even with 16 bits it's rather negligible compared to the >>>>>> minimum-128-bit >>>>>> entries of the table. >>>>>> >>>>>> I benchmarked these hash functions: >>>>>> >>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>> >>>>>> >>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>> think >>>>>> the first two should work well too (and when h is known compile-time, >>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>> >>>>>> >>>>>> My computer is acting up and all my numbers today are slower than the >>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>> ago, >>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>> compiled with -DIMHASH: >>>>>> >>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>> val=1800000000.000000 >>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>> val=1800000000.000000 >>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>> val=1800000000.000000 >>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>> val=1800000000.000000 >>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>> val=1800000000.000000 >>>>>> >>>>>> >>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>> >>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>> >>>>> >>>>> >>>>> The paper obviously puts more effort on minimizing table size and >>>>> not a >>>>> fast >>>>> lookup. My hunch is that our choice should be >>>>> >>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>> >>>>> >>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>> double >>>>> the number of bins than those 2 extra bits available for displacement >>>>> options). >>>>> >>>>> Then keep incrementing the size of d and the number of table slots (in >>>>> such >>>>> an order that the total vtable size is minimized) until success. In >>>>> practice >>>>> this should almost always just increase the size of d, and keep the >>>>> table >>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>> 128 >>>>> methods :-)) >>>>> >>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>> larger. >>>> >>>> >>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>> >>> >>> Well, table->d[const& const] compiles down to the same kind of code as >>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>> >>> >>>> hopefully its size wouldn't be a big issue either. What kinds of >>>> densities were you achieving? > > OK, simulation results just in (for the displace2 hash), and they > exceeded my expectations. > > I always fill the table with n=2^k keys, and fix b = n (b means |d|). > Then the failure rates are (top two are 100,000 simulations, the rest > are 1000 simulations): > > n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 > n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 > n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 > n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 > n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 > n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 > n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 > n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 > > Try-mean and try-max is how many r's needed to be tried before success, > so it gives an indication how much is left before failure. > > For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to > let b=2*n (100,000 simulations): > > n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 > n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 > > NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit > displacements they mostly failed. So we either need to make each element > of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which > succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits > displacements. > > The algorithm is rather fast and concise: > > https://github.com/dagss/hashvtable/blob/master/pagh99.py > >>> The algorithm is designed for 100% density in the table itself. (We >>> can lift >>> that to compensate for a small space of possible hash functions I >>> guess.) >>> >>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>> |d|=128 from the command line and I had 15 successes or so before the >>> first >>> failure. That's with a 100% density in the vtable itself! (And when it >>> fails, you increase |d| to get your success). >>> >>> The caveat is the space spent on d (it's small in comparison, but >>> that's why >>> this isn't too good to be true). >>> >>> A disadvantage might be that we may no longer have the opportunity to >>> not >>> make the table size a power of two (i.e. replace the mask with "if >>> (likely(slot< n))"). I think for that to work one would need to >>> replace the >>> xor group with addition on Z_d. >>> >>> >>>> Going back to the idea of linear probing on a cache miss, this has the >>>> advantage that one can write a brain-dead provider that sets m=0 and >>>> simply lists the methods instead of requiring a table optimizer. (Most >>>> tools, of course, would do the table optimization.) It also lets you >>>> get away with a "kind-of good" hash rather than requiring you search >>>> until you find a (larger?) perfect one. >>> >>> >>> Well, given that we can have 100% density, and generating the table is >>> lightning fast, and the C code to generate the table is likely a 300 >>> line >>> utility... I'm not convinced. >> >> It goes from an extraordinary simple spec (table is, at minimum, a >> func[2^k] with a couple of extra zero fields, whose struct can be >> statically defined in the source by hand) to a, well, not complicated >> in the absolute sense, but much more so than the definition above. It >> also is variable-size which makes allocating it globally/on a stack a >> pain (I suppose one can choose an upper bound for |d| and |vtable|). >> >> I am a bit playing devil's advocate here, it's probably just a (minor) >> con, but worth noting at least. > > If you were willing to go the interning route, so that you didn't need > to fill the table with md5 hashes anyway, I'd say you'd have a stronger > point :-) > > Given the results above, static allocation can at least be solved in a > way that is probably user-friendly enough: > > PyHashVTable_16_16 mytable; > > ...init () { > mytable.functions = { ... }; > if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; > } > > Now, with chance ~1/1000, you're going to get an exception saying > "Please try PyHashVTable_16_32". (And since that's deterministic given > the function definitions you always catch it at once.) PS. PyHashVTable_Ready would do the md5's and reorder the functions etc. as well. Dag From d.s.seljebotn at astro.uio.no Thu Jun 7 12:45:52 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 12:45:52 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <4FCE679C.7000002@astro.uio.no> <4FCE7CFB.7000205@astro.uio.no> Message-ID: <4FD08660.9080104@astro.uio.no> On 06/06/2012 11:00 PM, Robert Bradshaw wrote: > On Tue, Jun 5, 2012 at 2:41 PM, Dag Sverre Seljebotn > wrote: >> Is the goal then to avoid having to have an interning registry? > > Yes, and to avoid invoking an expensive hash function at runtime in > order to achieve good distribution. I don't understand. Compilation of call-sites would always generate a hash. You also need them while initializing/composing the hash table. But the storage and comparison of the hash rather than and interned string seems orthogonal to that. If it weren't for the security consern I agree with you. But I think Mark and Stefan makes a good point. Since you could hand a JIT-ed vtable (potentially the result of "trusted and verified user input") to a Cython function, *all* call-sites should use the full 160 bits. Interning solves this in a better way, and preserves vtable memory to boot. A collision registry would work against a security breach but still allow a DoS attack. Our dependencies are already: - md5 - Pagh99 algorithm Why not throw in an interning registry as well ;-) But then the end-result is pretty cool. >> Something that hasn't come up so far is that Cython doesn't know the exact >> types of external typedefs, so it can't generate the hash at Cythonize-time. >> I guess some support for build systems to probe for type sizes and compute >> the signature hashes in a sepearate header file would solve this -- with a >> fallback to computing them runtime at module loading, if you're not using a >> supported build system. (But suddenly an interning registry doesn't look so >> horrible..) > > It all depends on how strict you want to be. It may be acceptable to > let f(int) and f(long) not hash to the same value even if sizeof(int) > == sizeof(long). We could also promote all int types to long or long > long, including extern times (assuming, with a c-compile-time check, > external types declared up to "long" are<= sizeof(long)). Another Please no, I don't like any of those. We should not make the trouble with external typedefs worse than it already is. (Part of me wants to just declare that Cython is like Go with no implicit conversions to aovid inheriting the ugly coercion rules of C anyway...) > option is to let the hash be md5(sig) + hashN(sizeof(extern_arg1), > sizeof(extern_argN)) where hashN is a macro. Good idea. Would the following destroy all the nice properties of md5? I guess I wouldn't use it for crypto any longer...: hash("mymethod:iiZd") = md5("mymethod") ^ md5("i\x1") ^ md5("i\x2") ^ md5("Z\x3") ^ md5("d\x4") Dag From d.s.seljebotn at astro.uio.no Thu Jun 7 12:47:39 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 12:47:39 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD08660.9080104@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <4FCE679C.7000002@astro.uio.no> <4FCE7CFB.7000205@astro.uio.no> <4FD08660.9080104@astro.uio.no> Message-ID: <4FD086CB.5090201@astro.uio.no> On 06/07/2012 12:45 PM, Dag Sverre Seljebotn wrote: > On 06/06/2012 11:00 PM, Robert Bradshaw wrote: >> On Tue, Jun 5, 2012 at 2:41 PM, Dag Sverre Seljebotn >> wrote: >>> Is the goal then to avoid having to have an interning registry? >> >> Yes, and to avoid invoking an expensive hash function at runtime in >> order to achieve good distribution. > > I don't understand. Compilation of call-sites would always generate a > hash. You also need them while initializing/composing the hash table. > > But the storage and comparison of the hash rather than and interned > string seems orthogonal to that. > > If it weren't for the security consern I agree with you. But I think > Mark and Stefan makes a good point. Since you could hand a JIT-ed vtable > (potentially the result of "trusted and verified user input") to a > Cython function, *all* call-sites should use the full 160 bits. > > Interning solves this in a better way, and preserves vtable memory to boot. No, it's not necesarrily *better* -- I meant, it's going to be faster than the 160 bit compare. And I think throwing in a user option that anybody actually needs to care about would be a failure here. Dag > > A collision registry would work against a security breach but still > allow a DoS attack. > > Our dependencies are already: > > - md5 > - Pagh99 algorithm > > Why not throw in an interning registry as well ;-) > > But then the end-result is pretty cool. > >>> Something that hasn't come up so far is that Cython doesn't know the >>> exact >>> types of external typedefs, so it can't generate the hash at >>> Cythonize-time. >>> I guess some support for build systems to probe for type sizes and >>> compute >>> the signature hashes in a sepearate header file would solve this -- >>> with a >>> fallback to computing them runtime at module loading, if you're not >>> using a >>> supported build system. (But suddenly an interning registry doesn't >>> look so >>> horrible..) >> >> It all depends on how strict you want to be. It may be acceptable to >> let f(int) and f(long) not hash to the same value even if sizeof(int) >> == sizeof(long). We could also promote all int types to long or long >> long, including extern times (assuming, with a c-compile-time check, >> external types declared up to "long" are<= sizeof(long)). Another > > Please no, I don't like any of those. We should not make the trouble > with external typedefs worse than it already is. (Part of me wants to > just declare that Cython is like Go with no implicit conversions to > aovid inheriting the ugly coercion rules of C anyway...) > >> option is to let the hash be md5(sig) + hashN(sizeof(extern_arg1), >> sizeof(extern_argN)) where hashN is a macro. > > Good idea. Would the following destroy all the nice properties of md5? I > guess I wouldn't use it for crypto any longer...: > > hash("mymethod:iiZd") = > md5("mymethod") ^ md5("i\x1") ^ md5("i\x2") ^ md5("Z\x3") ^ md5("d\x4") > > Dag From dieter at handshake.de Thu Jun 7 13:32:43 2012 From: dieter at handshake.de (Dieter Maurer) Date: Thu, 7 Jun 2012 13:32:43 +0200 Subject: [Cython] Why does "__cinit__" insists on converting its arguments to Python objects? Message-ID: <20432.37211.743043.567461@localhost.localdomain> The following cython source leads to a "Cannot convert 'pointer' to Python object". ctypedef void * pointer cdef extern from "nonexistant.h": cdef pointer to_pointer(object) cdef class C: cdef pointer p def __cinit__(self, pointer p): self.p = p c = C(to_pointer(None)) Why does the constructor call tries an implicit conversion to a Python object even though it gets precisely the type indicated by its signature? I am working on a binding for "libxmlsec". The behaviour above leads to an unnatural mapping. Two examples: 1. "libxmlsec" has the concept of a key (used for digital signatures or encryption), naturally mapped onto a "cdef class Key" encapsulating the xmlsec key pointer. "libxmlsec" provides many functions to create keys - naturally mapped onto class methods used as alternative constructors. Would "Cython" allow C level parameters for "__cinit__", they could look like: cdef xmlSecKeyPtr xkey = ... some "libxmlsec" key generating function ... return Key(xkey) With the restriction, this must look like: cdef Key key key.xkey = ... some "libxmlsec" key generating function ... return key Not yet too bad, unless the constructor requires C level arguments. 2. "libxmlsec" provides a whole bunch of transforms, handled in C code via a set of so called "TransformId"s. Each "TransformId" is generated by a function. The natural way would like: cdef class TransformId: cdef xmlSecTransformId tid def __cinit__(self, xmlSecTransformId tid): self.tid = tid TransformInclC14N = TransformId(xmlSecTransformInclC14NGetKlass()) ... for all standard transforms ... The restriction forces the introduction of a helper function: cdef class TransformId: cdef xmlSecTransformId tid cdef _mkti(xmlSecTransformId tid): cdef TransformId t = TransformId() t.tid = tid return t TransformInclC14N = _mkti(xmlSecTransformInclC14NGetKlass()) ... for all standard transforms ... -- Dieter From d.s.seljebotn at astro.uio.no Thu Jun 7 14:24:32 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 07 Jun 2012 14:24:32 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD0808B.5080300@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> Message-ID: <4FD09D80.7020601@astro.uio.no> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: > On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>> >>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>> wrote: >>>>> >>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>> >>>>>> >>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>> >>>>>>> >>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>> cheap, >>>>>>> I have the feeling that branching is considered cheap in this >>>>>>> context. >>>>>> >>>>>> >>>>>> >>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>> reference I actually read through and didn't eliminate after the >>>>>> abstract totally swept away our home-grown solutions! >>>>>> >>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>> >>>>>> understand, and fast both for generation and (the branch-free) >>>>>> lookup: >>>>>> >>>>>> >>>>>> >>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>> >>>>>> >>>>>> >>>>>> The idea is: >>>>>> >>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper >>>>>> requires b> 2n, though I think in practice you can often get away >>>>>> with >>>>>> less) >>>>>> >>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>> easily achieved since groups only has a few elements) >>>>>> >>>>>> - For each group, from largest to smallest: Find a displacement >>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>> >>>>>> It requires extra storage for the displacement table. However, I >>>>>> think 8 >>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>> size. >>>>>> Even with 16 bits it's rather negligible compared to the >>>>>> minimum-128-bit >>>>>> entries of the table. >>>>>> >>>>>> I benchmarked these hash functions: >>>>>> >>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>> >>>>>> >>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>> think >>>>>> the first two should work well too (and when h is known compile-time, >>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>> >>>>>> >>>>>> My computer is acting up and all my numbers today are slower than the >>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>> ago, >>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>> compiled with -DIMHASH: >>>>>> >>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000 >>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>> val=1800000000.000000 >>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>> val=1800000000.000000 >>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>> val=1800000000.000000 >>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>> val=1800000000.000000 >>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>> val=1800000000.000000 >>>>>> >>>>>> >>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>> >>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>> >>>>> >>>>> >>>>> The paper obviously puts more effort on minimizing table size and >>>>> not a >>>>> fast >>>>> lookup. My hunch is that our choice should be >>>>> >>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>> >>>>> >>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>> double >>>>> the number of bins than those 2 extra bits available for displacement >>>>> options). >>>>> >>>>> Then keep incrementing the size of d and the number of table slots (in >>>>> such >>>>> an order that the total vtable size is minimized) until success. In >>>>> practice >>>>> this should almost always just increase the size of d, and keep the >>>>> table >>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>> 128 >>>>> methods :-)) >>>>> >>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>> larger. >>>> >>>> >>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>> >>> >>> Well, table->d[const& const] compiles down to the same kind of code as >>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>> >>> >>>> hopefully its size wouldn't be a big issue either. What kinds of >>>> densities were you achieving? > > OK, simulation results just in (for the displace2 hash), and they > exceeded my expectations. > > I always fill the table with n=2^k keys, and fix b = n (b means |d|). > Then the failure rates are (top two are 100,000 simulations, the rest > are 1000 simulations): > > n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 > n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 > n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 > n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 > n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 > n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 > n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 > n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 > > Try-mean and try-max is how many r's needed to be tried before success, > so it gives an indication how much is left before failure. > > For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to > let b=2*n (100,000 simulations): > > n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 > n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 > > NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit > displacements they mostly failed. So we either need to make each element > of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which > succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits > displacements. > > The algorithm is rather fast and concise: > > https://github.com/dagss/hashvtable/blob/master/pagh99.py > >>> The algorithm is designed for 100% density in the table itself. (We >>> can lift >>> that to compensate for a small space of possible hash functions I >>> guess.) >>> >>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>> |d|=128 from the command line and I had 15 successes or so before the >>> first >>> failure. That's with a 100% density in the vtable itself! (And when it >>> fails, you increase |d| to get your success). >>> >>> The caveat is the space spent on d (it's small in comparison, but >>> that's why >>> this isn't too good to be true). >>> >>> A disadvantage might be that we may no longer have the opportunity to >>> not >>> make the table size a power of two (i.e. replace the mask with "if >>> (likely(slot< n))"). I think for that to work one would need to >>> replace the >>> xor group with addition on Z_d. >>> >>> >>>> Going back to the idea of linear probing on a cache miss, this has the >>>> advantage that one can write a brain-dead provider that sets m=0 and >>>> simply lists the methods instead of requiring a table optimizer. (Most >>>> tools, of course, would do the table optimization.) It also lets you >>>> get away with a "kind-of good" hash rather than requiring you search >>>> until you find a (larger?) perfect one. >>> >>> >>> Well, given that we can have 100% density, and generating the table is >>> lightning fast, and the C code to generate the table is likely a 300 >>> line >>> utility... I'm not convinced. >> >> It goes from an extraordinary simple spec (table is, at minimum, a >> func[2^k] with a couple of extra zero fields, whose struct can be >> statically defined in the source by hand) to a, well, not complicated >> in the absolute sense, but much more so than the definition above. It >> also is variable-size which makes allocating it globally/on a stack a >> pain (I suppose one can choose an upper bound for |d| and |vtable|). >> >> I am a bit playing devil's advocate here, it's probably just a (minor) >> con, but worth noting at least. > > If you were willing to go the interning route, so that you didn't need > to fill the table with md5 hashes anyway, I'd say you'd have a stronger > point :-) Here's a good reason to demand perfect hashing in the callee: Suppose you want to first check the interface once, then keep using the vtable -- e.g, *if* we want Cython to raise TypeError on the interface coercion *and* we decide we don't want to mess with constructing C++-style vtables on the fly, then code like this: cdef f(SomeInterface obj): return obj.some_method(1.0) would simply expect that the vtable contained the method, and skip the ID comparison entirely. No comparison is faster than either 64-bit hash comparison and interned comparison. :-) I'm not saying the above decisions must be made, but the possibility seems reason enough to demand perfect hashing. Dag From robertwb at gmail.com Thu Jun 7 20:00:42 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Thu, 7 Jun 2012 11:00:42 -0700 Subject: [Cython] Why does "__cinit__" insists on converting its arguments to Python objects? In-Reply-To: <20432.37211.743043.567461@localhost.localdomain> References: <20432.37211.743043.567461@localhost.localdomain> Message-ID: Both __init__ and __cinit__ are passed the same arguments, and the former has Python calling conventions. (Also, we use Python's framework to allocate and construct the new object, so there's not a huge amount of flexibility here and working around this would be quite non-trivial). On Thu, Jun 7, 2012 at 4:32 AM, Dieter Maurer wrote: > The following cython source leads to a "Cannot convert 'pointer' to Python object". > > ctypedef void * pointer > > cdef extern from "nonexistant.h": > ?cdef pointer to_pointer(object) > > cdef class C: > ?cdef pointer p > > ?def __cinit__(self, pointer p): self.p = p > > c = C(to_pointer(None)) > > Why does the constructor call tries an implicit conversion to a > Python object even though it gets precisely the type indicated by > its signature? > > > I am working on a binding for "libxmlsec". The behaviour above leads > to an unnatural mapping. Two examples: > > 1. "libxmlsec" has the concept of a key (used for digital signatures or > ? encryption), naturally mapped onto a "cdef class Key" encapsulating > ? the xmlsec key pointer. > > ? "libxmlsec" provides many functions to create keys - naturally mapped > ? onto class methods used as alternative constructors. > ? Would "Cython" allow C level parameters for "__cinit__", > ? they could look like: > > ? ? cdef xmlSecKeyPtr xkey = ... some "libxmlsec" key generating function ... > ? ? return Key(xkey) > > ? With the restriction, this must look like: > > ? ? cdef Key key > ? ? key.xkey = ... some "libxmlsec" key generating function ... > ? ? return key > > ? Not yet too bad, unless the constructor requires C level arguments. > > 2. "libxmlsec" provides a whole bunch of transforms, handled in C code > ? via a set of so called "TransformId"s. Each "TransformId" is > ? generated by a function. > > ? The natural way would like: > > ? ? ?cdef class TransformId: > ? ? ? ?cdef xmlSecTransformId tid > ? ? ? ?def __cinit__(self, xmlSecTransformId tid): self.tid = tid > > ? ? ?TransformInclC14N = TransformId(xmlSecTransformInclC14NGetKlass()) > ? ? ?... for all standard transforms ... > > ? The restriction forces the introduction of a helper function: > > ? ? ?cdef class TransformId: > ? ? ? ?cdef xmlSecTransformId tid > > ? ? ?cdef _mkti(xmlSecTransformId tid): > ? ? ? ?cdef TransformId t = TransformId() > ? ? ? ?t.tid = tid > ? ? ? ?return t > > ? ? ?TransformInclC14N = _mkti(xmlSecTransformInclC14NGetKlass()) > ? ? ?... for all standard transforms ... > > > > > > -- > Dieter > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Fri Jun 8 08:50:47 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 08 Jun 2012 08:50:47 +0200 Subject: [Cython] Bug: bad C code generated for (some) "... and ... or ..." expressions In-Reply-To: <20432.27097.470830.218794@localhost.localdomain> References: <20432.27097.470830.218794@localhost.localdomain> Message-ID: <4FD1A0C7.7080903@behnel.de> Hi, thanks for the report. Dieter Maurer, 07.06.2012 10:44: > "cython 0.13" generates bad C code for the attached "pyx" file. Could you try the latest release? I would at least expect an error instead of actually generating code. > "cython" itself recognizes that it did something wrong and emits ";" > to the generated file: > > ... > static __pyx_t_12cybug_and_or_pointer __pyx_f_12cybug_and_or_bug(PyObject *__pyx_v_o) { > __pyx_t_12cybug_and_or_pointer __pyx_r; > int __pyx_t_1; > __pyx_t_12cybug_and_or_pointer __pyx_t_2; > ; > ; > ... > This is generated from this Cython code: > cdef pointer bug(o): > return o is not None and to_pointer(o) or NULL The right way to implement this is: return to_pointer(o) if o is not None else NULL > The error probably happens because it is difficult for "cython" to > determine the type for "and" and "or" expressions (if the operand types > differ). In an "cond and t or f" expression, however, the result type > is "type(t)" if "type(t) == type(f)", independent of "type(cond)". Independent of the condition, yes. However, the types of the two expression results differ here, and the fact that you named your initial condition "cond" just hides the fact that it is not different from the other two parts (t and f) of the expression. The Python semantics of this kind of evaluation is more complex than you might think. > It might not be worse to special case this type of expressions. -1 > It would however be more friendly to output an instructive > error message instead of generating bad C code. Absolutely. Stefan From dieter at handshake.de Fri Jun 8 13:38:22 2012 From: dieter at handshake.de (Dieter Maurer) Date: Fri, 8 Jun 2012 13:38:22 +0200 Subject: [Cython] Bug: bad C code generated for (some) "... and ... or ..." expressions In-Reply-To: <4FD1A0C7.7080903@behnel.de> References: <20432.27097.470830.218794@localhost.localdomain> <4FD1A0C7.7080903@behnel.de> Message-ID: <20433.58414.162418.381590@localhost.localdomain> Stefan Behnel wrote at 2012-6-8 08:50 +0200: >thanks for the report. > >Dieter Maurer, 07.06.2012 10:44: >> "cython 0.13" generates bad C code for the attached "pyx" file. > >Could you try the latest release? I would at least expect an error instead >of actually generating code. The latest release on PyPI is "0.16". It behaves identical to version "0.13": no error message; just wrongly generated C code (C code containing ";" "statements". -- Dieter From d.s.seljebotn at astro.uio.no Fri Jun 8 23:12:58 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Fri, 08 Jun 2012 23:12:58 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD083F9.2030006@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> Message-ID: <4FD26ADA.5060401@astro.uio.no> On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: > On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: >> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >>> wrote: >>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>>> >>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>>> wrote: >>>>>> >>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>>> >>>>>>> >>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>>> >>>>>>>> >>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>>> cheap, >>>>>>>> I have the feeling that branching is considered cheap in this >>>>>>>> context. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>>> reference I actually read through and didn't eliminate after the >>>>>>> abstract totally swept away our home-grown solutions! >>>>>>> >>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>>> >>>>>>> understand, and fast both for generation and (the branch-free) >>>>>>> lookup: >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The idea is: >>>>>>> >>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the >>>>>>> paper >>>>>>> requires b> 2n, though I think in practice you can often get away >>>>>>> with >>>>>>> less) >>>>>>> >>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>>> easily achieved since groups only has a few elements) >>>>>>> >>>>>>> - For each group, from largest to smallest: Find a displacement >>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>>> >>>>>>> It requires extra storage for the displacement table. However, I >>>>>>> think 8 >>>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>>> size. >>>>>>> Even with 16 bits it's rather negligible compared to the >>>>>>> minimum-128-bit >>>>>>> entries of the table. >>>>>>> >>>>>>> I benchmarked these hash functions: >>>>>>> >>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>>> >>>>>>> >>>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>>> think >>>>>>> the first two should work well too (and when h is known >>>>>>> compile-time, >>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>>> >>>>>>> >>>>>>> My computer is acting up and all my numbers today are slower than >>>>>>> the >>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>>> ago, >>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>>> compiled with -DIMHASH: >>>>>>> >>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 >>>>>>> val=2400000000.000000 >>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>>> val=1800000000.000000 >>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>>> val=1800000000.000000 >>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>>> val=1800000000.000000 >>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>>> val=1800000000.000000 >>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>>> val=1800000000.000000 >>>>>>> >>>>>>> >>>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>>> >>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>>> >>>>>> >>>>>> >>>>>> The paper obviously puts more effort on minimizing table size and >>>>>> not a >>>>>> fast >>>>>> lookup. My hunch is that our choice should be >>>>>> >>>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>>> >>>>>> >>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>>> double >>>>>> the number of bins than those 2 extra bits available for displacement >>>>>> options). >>>>>> >>>>>> Then keep incrementing the size of d and the number of table slots >>>>>> (in >>>>>> such >>>>>> an order that the total vtable size is minimized) until success. In >>>>>> practice >>>>>> this should almost always just increase the size of d, and keep the >>>>>> table >>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>>> 128 >>>>>> methods :-)) >>>>>> >>>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>>> larger. >>>>> >>>>> >>>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>>> >>>> >>>> Well, table->d[const& const] compiles down to the same kind of code as >>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>>> >>>> >>>>> hopefully its size wouldn't be a big issue either. What kinds of >>>>> densities were you achieving? >> >> OK, simulation results just in (for the displace2 hash), and they >> exceeded my expectations. >> >> I always fill the table with n=2^k keys, and fix b = n (b means |d|). >> Then the failure rates are (top two are 100,000 simulations, the rest >> are 1000 simulations): >> >> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 >> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 >> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 >> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 >> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 >> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 >> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 >> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 >> >> Try-mean and try-max is how many r's needed to be tried before success, >> so it gives an indication how much is left before failure. >> >> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to >> let b=2*n (100,000 simulations): >> >> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 >> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 >> >> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit >> displacements they mostly failed. So we either need to make each element >> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which >> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits >> displacements. >> >> The algorithm is rather fast and concise: >> >> https://github.com/dagss/hashvtable/blob/master/pagh99.py >> >>>> The algorithm is designed for 100% density in the table itself. (We >>>> can lift >>>> that to compensate for a small space of possible hash functions I >>>> guess.) >>>> >>>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>>> |d|=128 from the command line and I had 15 successes or so before the >>>> first >>>> failure. That's with a 100% density in the vtable itself! (And when it >>>> fails, you increase |d| to get your success). >>>> >>>> The caveat is the space spent on d (it's small in comparison, but >>>> that's why >>>> this isn't too good to be true). >>>> >>>> A disadvantage might be that we may no longer have the opportunity to >>>> not >>>> make the table size a power of two (i.e. replace the mask with "if >>>> (likely(slot< n))"). I think for that to work one would need to >>>> replace the >>>> xor group with addition on Z_d. >>>> >>>> >>>>> Going back to the idea of linear probing on a cache miss, this has the >>>>> advantage that one can write a brain-dead provider that sets m=0 and >>>>> simply lists the methods instead of requiring a table optimizer. (Most >>>>> tools, of course, would do the table optimization.) It also lets you >>>>> get away with a "kind-of good" hash rather than requiring you search >>>>> until you find a (larger?) perfect one. >>>> >>>> >>>> Well, given that we can have 100% density, and generating the table is >>>> lightning fast, and the C code to generate the table is likely a 300 >>>> line >>>> utility... I'm not convinced. >>> >>> It goes from an extraordinary simple spec (table is, at minimum, a >>> func[2^k] with a couple of extra zero fields, whose struct can be >>> statically defined in the source by hand) to a, well, not complicated >>> in the absolute sense, but much more so than the definition above. It >>> also is variable-size which makes allocating it globally/on a stack a >>> pain (I suppose one can choose an upper bound for |d| and |vtable|). >>> >>> I am a bit playing devil's advocate here, it's probably just a (minor) >>> con, but worth noting at least. >> >> If you were willing to go the interning route, so that you didn't need >> to fill the table with md5 hashes anyway, I'd say you'd have a stronger >> point :-) >> >> Given the results above, static allocation can at least be solved in a >> way that is probably user-friendly enough: >> >> PyHashVTable_16_16 mytable; >> >> ...init () { >> mytable.functions = { ... }; >> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; >> } >> >> Now, with chance ~1/1000, you're going to get an exception saying >> "Please try PyHashVTable_16_32". (And since that's deterministic given >> the function definitions you always catch it at once.) > > PS. PyHashVTable_Ready would do the md5's and reorder the functions etc. > as well. There's still the indirection through SEP 200 (extensibletype slots). We can get rid of that very easily by just making that table and the hash-vtable one and the same. (It could still either have interned string keys or ID keys depending on the least significant bit.) To wrap up, I think this has grown in complexity beyond the "simple SEP spec". It's at the point where you don't really want to have several libraries implementing the same simple spec, but instead use the same implementation. But I think the advantages are simply too good to give up on. So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach for now (which only works for semi-complicated ideas with simple implementations) and instead simply work more directly on a library. It would need to have a couple of different use modes: - A Python perfect-hasher for use when generating code, with only the a string interner based on CPython dicts and extensibletype metaclass as runtime dependencies (for use in Cython). This would only add some hundred source file lines... - A C implementation of the perfect hashing exposed through a PyPerfectHashTable_Ready(), for use in libraries written in C like NumPy/SciPy). This would need to bundle the md5 algorithm and a C implementation of the perfect hashing. And on the distribution axis: - Small C header-style implementation of a string interner and the extensibletype metaclass, rendezvousing through sys.modules - As part of the rendezvous, one would always try to __import__ the *real* run-time library. So if it is available in sys.path it overrides anything bundled with other libraries. That would provide a way forward for GIL-less string interning, a Python-side library for working with these tables and inspecting them, etc. Time to stop talking and start coding... Dag From robertwb at gmail.com Sat Jun 9 03:21:23 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Fri, 8 Jun 2012 18:21:23 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD26ADA.5060401@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> Message-ID: On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn wrote: > On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: >> >> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >>>> >>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >>>> wrote: >>>>> >>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>>>> >>>>>> >>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>>>> cheap, >>>>>>>>> I have the feeling that branching is considered cheap in this >>>>>>>>> context. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>>>> reference I actually read through and didn't eliminate after the >>>>>>>> abstract totally swept away our home-grown solutions! >>>>>>>> >>>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>>>> >>>>>>>> understand, and fast both for generation and (the branch-free) >>>>>>>> lookup: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The idea is: >>>>>>>> >>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the >>>>>>>> paper >>>>>>>> requires b> 2n, though I think in practice you can often get away >>>>>>>> with >>>>>>>> less) >>>>>>>> >>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>>>> easily achieved since groups only has a few elements) >>>>>>>> >>>>>>>> - For each group, from largest to smallest: Find a displacement >>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>>>> >>>>>>>> It requires extra storage for the displacement table. However, I >>>>>>>> think 8 >>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>>>> size. >>>>>>>> Even with 16 bits it's rather negligible compared to the >>>>>>>> minimum-128-bit >>>>>>>> entries of the table. >>>>>>>> >>>>>>>> I benchmarked these hash functions: >>>>>>>> >>>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>>>> >>>>>>>> >>>>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>>>> think >>>>>>>> the first two should work well too (and when h is known >>>>>>>> compile-time, >>>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>>>> >>>>>>>> >>>>>>>> My computer is acting up and all my numbers today are slower than >>>>>>>> the >>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>>>> ago, >>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>>>> compiled with -DIMHASH: >>>>>>>> >>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 >>>>>>>> val=2400000000.000000 >>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> >>>>>>>> >>>>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>>>> >>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The paper obviously puts more effort on minimizing table size and >>>>>>> not a >>>>>>> fast >>>>>>> lookup. My hunch is that our choice should be >>>>>>> >>>>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>>>> >>>>>>> >>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>>>> double >>>>>>> the number of bins than those 2 extra bits available for displacement >>>>>>> options). >>>>>>> >>>>>>> Then keep incrementing the size of d and the number of table slots >>>>>>> (in >>>>>>> such >>>>>>> an order that the total vtable size is minimized) until success. In >>>>>>> practice >>>>>>> this should almost always just increase the size of d, and keep the >>>>>>> table >>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>>>> 128 >>>>>>> methods :-)) >>>>>>> >>>>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>>>> larger. >>>>>> >>>>>> >>>>>> >>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>>>> >>>>> >>>>> >>>>> Well, table->d[const& const] compiles down to the same kind of code as >>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>>>> >>>>> >>>>>> hopefully its size wouldn't be a big issue either. What kinds of >>>>>> densities were you achieving? >>> >>> >>> OK, simulation results just in (for the displace2 hash), and they >>> exceeded my expectations. >>> >>> I always fill the table with n=2^k keys, and fix b = n (b means |d|). >>> Then the failure rates are (top two are 100,000 simulations, the rest >>> are 1000 simulations): >>> >>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 >>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 >>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 >>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 >>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 >>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 >>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 >>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 >>> >>> Try-mean and try-max is how many r's needed to be tried before success, >>> so it gives an indication how much is left before failure. >>> >>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to >>> let b=2*n (100,000 simulations): >>> >>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 >>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 >>> >>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit >>> displacements they mostly failed. So we either need to make each element >>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which >>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits >>> displacements. >>> >>> The algorithm is rather fast and concise: >>> >>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>> >>>>> The algorithm is designed for 100% density in the table itself. (We >>>>> can lift >>>>> that to compensate for a small space of possible hash functions I >>>>> guess.) >>>>> >>>>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>>>> |d|=128 from the command line and I had 15 successes or so before the >>>>> first >>>>> failure. That's with a 100% density in the vtable itself! (And when it >>>>> fails, you increase |d| to get your success). >>>>> >>>>> The caveat is the space spent on d (it's small in comparison, but >>>>> that's why >>>>> this isn't too good to be true). >>>>> >>>>> A disadvantage might be that we may no longer have the opportunity to >>>>> not >>>>> make the table size a power of two (i.e. replace the mask with "if >>>>> (likely(slot< n))"). I think for that to work one would need to >>>>> replace the >>>>> xor group with addition on Z_d. >>>>> >>>>> >>>>>> Going back to the idea of linear probing on a cache miss, this has the >>>>>> advantage that one can write a brain-dead provider that sets m=0 and >>>>>> simply lists the methods instead of requiring a table optimizer. (Most >>>>>> tools, of course, would do the table optimization.) It also lets you >>>>>> get away with a "kind-of good" hash rather than requiring you search >>>>>> until you find a (larger?) perfect one. >>>>> >>>>> >>>>> >>>>> Well, given that we can have 100% density, and generating the table is >>>>> lightning fast, and the C code to generate the table is likely a 300 >>>>> line >>>>> utility... I'm not convinced. >>>> >>>> >>>> It goes from an extraordinary simple spec (table is, at minimum, a >>>> func[2^k] with a couple of extra zero fields, whose struct can be >>>> statically defined in the source by hand) to a, well, not complicated >>>> in the absolute sense, but much more so than the definition above. It >>>> also is variable-size which makes allocating it globally/on a stack a >>>> pain (I suppose one can choose an upper bound for |d| and |vtable|). >>>> >>>> I am a bit playing devil's advocate here, it's probably just a (minor) >>>> con, but worth noting at least. >>> >>> >>> If you were willing to go the interning route, so that you didn't need >>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger >>> point :-) >>> >>> Given the results above, static allocation can at least be solved in a >>> way that is probably user-friendly enough: >>> >>> PyHashVTable_16_16 mytable; >>> >>> ...init () { >>> mytable.functions = { ... }; >>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; >>> } >>> >>> Now, with chance ~1/1000, you're going to get an exception saying >>> "Please try PyHashVTable_16_32". (And since that's deterministic given >>> the function definitions you always catch it at once.) >> >> >> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc. >> as well. > > > > There's still the indirection through SEP 200 (extensibletype slots). We can > get rid of that very easily by just making that table and the hash-vtable > one and the same. (It could still either have interned string keys or ID > keys depending on the least significant bit.) Or we can even forgo the interning for this table, and give up on partitioning the space numerically and just use the dns-style prefixing, e.g. "org.cython.X" belongs to us. There is value in the double indirection if this (or any of the other) lookup tables are meant to be modified over time. > To wrap up, I think this has grown in complexity beyond the "simple SEP > spec". It's at the point where you don't really want to have several > libraries implementing the same simple spec, but instead use the same > implementation. > > But I think the advantages are simply too good to give up on. > > So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach > for now (which only works for semi-complicated ideas with simple > implementations) and instead simply work more directly on a library. It > would need to have a couple of different use modes: I prefer an enhancement proposal with a spec over a library, even if only a single library gets used in practice. I still think it's simple enough. Basically, we have the "lookup spec" and then a CEP for applying this to fast callable (agreeing on signatures, and what to do with extern types) and extensible type slots. > ?- A Python perfect-hasher for use when generating code, with only the a > string interner based on CPython dicts and extensibletype metaclass as > runtime dependencies (for use in Cython). This would only add some hundred > source file lines... > > ?- A C implementation of the perfect hashing exposed through a > PyPerfectHashTable_Ready(), for use in libraries written in C like > NumPy/SciPy). This would need to bundle the md5 algorithm and a C > implementation of the perfect hashing. > > And on the distribution axis: > > ?- Small C header-style implementation of a string interner and the > extensibletype metaclass, rendezvousing through sys.modules > > ?- As part of the rendezvous, one would always try to __import__ the *real* > run-time library. So if it is available in sys.path it overrides anything > bundled with other libraries. That would provide a way forward for GIL-less > string interning, a Python-side library for working with these tables and > inspecting them, etc. Hmm, that's an interesting idea. I think we don't actually need interning, in which case the "full" library is only needed for introspection. > Time to stop talking and start coding... > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From d.s.seljebotn at astro.uio.no Sat Jun 9 07:45:55 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 09 Jun 2012 07:45:55 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> Message-ID: <4FD2E313.6040208@astro.uio.no> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: > On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn > wrote: >> On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: >>>> >>>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >>>>> >>>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >>>>> wrote: >>>>>> >>>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>>>>> cheap, >>>>>>>>>> I have the feeling that branching is considered cheap in this >>>>>>>>>> context. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>>>>> reference I actually read through and didn't eliminate after the >>>>>>>>> abstract totally swept away our home-grown solutions! >>>>>>>>> >>>>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>>>>> >>>>>>>>> understand, and fast both for generation and (the branch-free) >>>>>>>>> lookup: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The idea is: >>>>>>>>> >>>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the >>>>>>>>> paper >>>>>>>>> requires b> 2n, though I think in practice you can often get away >>>>>>>>> with >>>>>>>>> less) >>>>>>>>> >>>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>>>>> easily achieved since groups only has a few elements) >>>>>>>>> >>>>>>>>> - For each group, from largest to smallest: Find a displacement >>>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>>>>> >>>>>>>>> It requires extra storage for the displacement table. However, I >>>>>>>>> think 8 >>>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>>>>> size. >>>>>>>>> Even with 16 bits it's rather negligible compared to the >>>>>>>>> minimum-128-bit >>>>>>>>> entries of the table. >>>>>>>>> >>>>>>>>> I benchmarked these hash functions: >>>>>>>>> >>>>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>>>>> think >>>>>>>>> the first two should work well too (and when h is known >>>>>>>>> compile-time, >>>>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>>>>> >>>>>>>>> >>>>>>>>> My computer is acting up and all my numbers today are slower than >>>>>>>>> the >>>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>>>>> ago, >>>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>>>>> compiled with -DIMHASH: >>>>>>>>> >>>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 >>>>>>>>> val=2400000000.000000 >>>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>>>>> val=1800000000.000000 >>>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>>>>> val=1800000000.000000 >>>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>>>>> val=1800000000.000000 >>>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>>>>> val=1800000000.000000 >>>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>>>>> val=1800000000.000000 >>>>>>>>> >>>>>>>>> >>>>>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>>>>> >>>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The paper obviously puts more effort on minimizing table size and >>>>>>>> not a >>>>>>>> fast >>>>>>>> lookup. My hunch is that our choice should be >>>>>>>> >>>>>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>>>>> >>>>>>>> >>>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>>>>> double >>>>>>>> the number of bins than those 2 extra bits available for displacement >>>>>>>> options). >>>>>>>> >>>>>>>> Then keep incrementing the size of d and the number of table slots >>>>>>>> (in >>>>>>>> such >>>>>>>> an order that the total vtable size is minimized) until success. In >>>>>>>> practice >>>>>>>> this should almost always just increase the size of d, and keep the >>>>>>>> table >>>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>>>>> 128 >>>>>>>> methods :-)) >>>>>>>> >>>>>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>>>>> larger. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>>>>> >>>>>> >>>>>> >>>>>> Well, table->d[const& const] compiles down to the same kind of code as >>>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>>>>> >>>>>> >>>>>>> hopefully its size wouldn't be a big issue either. What kinds of >>>>>>> densities were you achieving? >>>> >>>> >>>> OK, simulation results just in (for the displace2 hash), and they >>>> exceeded my expectations. >>>> >>>> I always fill the table with n=2^k keys, and fix b = n (b means |d|). >>>> Then the failure rates are (top two are 100,000 simulations, the rest >>>> are 1000 simulations): >>>> >>>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 >>>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 >>>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 >>>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 >>>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 >>>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 >>>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 >>>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 >>>> >>>> Try-mean and try-max is how many r's needed to be tried before success, >>>> so it gives an indication how much is left before failure. >>>> >>>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to >>>> let b=2*n (100,000 simulations): >>>> >>>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 >>>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 >>>> >>>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit >>>> displacements they mostly failed. So we either need to make each element >>>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which >>>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits >>>> displacements. >>>> >>>> The algorithm is rather fast and concise: >>>> >>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>> >>>>>> The algorithm is designed for 100% density in the table itself. (We >>>>>> can lift >>>>>> that to compensate for a small space of possible hash functions I >>>>>> guess.) >>>>>> >>>>>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>>>>> |d|=128 from the command line and I had 15 successes or so before the >>>>>> first >>>>>> failure. That's with a 100% density in the vtable itself! (And when it >>>>>> fails, you increase |d| to get your success). >>>>>> >>>>>> The caveat is the space spent on d (it's small in comparison, but >>>>>> that's why >>>>>> this isn't too good to be true). >>>>>> >>>>>> A disadvantage might be that we may no longer have the opportunity to >>>>>> not >>>>>> make the table size a power of two (i.e. replace the mask with "if >>>>>> (likely(slot< n))"). I think for that to work one would need to >>>>>> replace the >>>>>> xor group with addition on Z_d. >>>>>> >>>>>> >>>>>>> Going back to the idea of linear probing on a cache miss, this has the >>>>>>> advantage that one can write a brain-dead provider that sets m=0 and >>>>>>> simply lists the methods instead of requiring a table optimizer. (Most >>>>>>> tools, of course, would do the table optimization.) It also lets you >>>>>>> get away with a "kind-of good" hash rather than requiring you search >>>>>>> until you find a (larger?) perfect one. >>>>>> >>>>>> >>>>>> >>>>>> Well, given that we can have 100% density, and generating the table is >>>>>> lightning fast, and the C code to generate the table is likely a 300 >>>>>> line >>>>>> utility... I'm not convinced. >>>>> >>>>> >>>>> It goes from an extraordinary simple spec (table is, at minimum, a >>>>> func[2^k] with a couple of extra zero fields, whose struct can be >>>>> statically defined in the source by hand) to a, well, not complicated >>>>> in the absolute sense, but much more so than the definition above. It >>>>> also is variable-size which makes allocating it globally/on a stack a >>>>> pain (I suppose one can choose an upper bound for |d| and |vtable|). >>>>> >>>>> I am a bit playing devil's advocate here, it's probably just a (minor) >>>>> con, but worth noting at least. >>>> >>>> >>>> If you were willing to go the interning route, so that you didn't need >>>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger >>>> point :-) >>>> >>>> Given the results above, static allocation can at least be solved in a >>>> way that is probably user-friendly enough: >>>> >>>> PyHashVTable_16_16 mytable; >>>> >>>> ...init () { >>>> mytable.functions = { ... }; >>>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; >>>> } >>>> >>>> Now, with chance ~1/1000, you're going to get an exception saying >>>> "Please try PyHashVTable_16_32". (And since that's deterministic given >>>> the function definitions you always catch it at once.) >>> >>> >>> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc. >>> as well. >> >> >> >> There's still the indirection through SEP 200 (extensibletype slots). We can >> get rid of that very easily by just making that table and the hash-vtable >> one and the same. (It could still either have interned string keys or ID >> keys depending on the least significant bit.) > > Or we can even forgo the interning for this table, and give up on > partitioning the space numerically and just use the dns-style > prefixing, e.g. "org.cython.X" belongs to us. Huh? Isn't that when you *need* interning? Do you plan on key-encoding those kind of strings into 64 bits? (I think it would usually be "method:foo:ii->d" (or my current preference is "method:foo:i4i8->f8")) Partitioning the space numerically you'd just hash the number; "SEP 260: We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". > There is value in the double indirection if this (or any of the other) > lookup tables are meant to be modified over time. This isn't impossible with a hash table either. You just need to reallocate a little more often than what would be the case with a regular hash table, but not dramatically so (you need to rehash whenever the element to insert hashes into a "large" bin, which are rather few). I want the table to have a pointer to it, so that you can atomically swap it out. >> To wrap up, I think this has grown in complexity beyond the "simple SEP >> spec". It's at the point where you don't really want to have several >> libraries implementing the same simple spec, but instead use the same >> implementation. >> >> But I think the advantages are simply too good to give up on. >> >> So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach >> for now (which only works for semi-complicated ideas with simple >> implementations) and instead simply work more directly on a library. It >> would need to have a couple of different use modes: > > I prefer an enhancement proposal with a spec over a library, even if > only a single library gets used in practice. I still think it's simple > enough. Basically, we have the "lookup spec" and then a CEP for > applying this to fast callable (agreeing on signatures, and what to do > with extern types) and extensible type slots. OK. > >> - A Python perfect-hasher for use when generating code, with only the a >> string interner based on CPython dicts and extensibletype metaclass as >> runtime dependencies (for use in Cython). This would only add some hundred >> source file lines... >> >> - A C implementation of the perfect hashing exposed through a >> PyPerfectHashTable_Ready(), for use in libraries written in C like >> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >> implementation of the perfect hashing. >> >> And on the distribution axis: >> >> - Small C header-style implementation of a string interner and the >> extensibletype metaclass, rendezvousing through sys.modules >> >> - As part of the rendezvous, one would always try to __import__ the *real* >> run-time library. So if it is available in sys.path it overrides anything >> bundled with other libraries. That would provide a way forward for GIL-less >> string interning, a Python-side library for working with these tables and >> inspecting them, etc. > > Hmm, that's an interesting idea. I think we don't actually need > interning, in which case the "full" library is only needed for > introspection. You don't believe the security concern is real then? Or do you want to pay the cost for a 160-bit SHA1 compare everywhere? I'd love to not do interning, but I see no way around it. BTW, a GIL-less interning library isn't rocket science. I ran khash.h through a preprocessor with KHASH_MAP_INIT_STR(str_to_entry, entry_t) and the result is 180 lines of code for the hash table. Then pythread.h provides the thread lock, another 50 lines for the interning logic (intern_literal, intern_heap_allocated, release_interned). It just seems a little redundant to ship such a thing in every Cython-generated file since we hold the GIL during module loading anyway. Dag From d.s.seljebotn at astro.uio.no Sat Jun 9 08:00:50 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 09 Jun 2012 08:00:50 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD2E313.6040208@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> Message-ID: <4FD2E692.4040404@astro.uio.no> On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote: > On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>> There's still the indirection through SEP 200 (extensibletype slots). >>> We can >>> get rid of that very easily by just making that table and the >>> hash-vtable >>> one and the same. (It could still either have interned string keys or ID >>> keys depending on the least significant bit.) >> >> Or we can even forgo the interning for this table, and give up on >> partitioning the space numerically and just use the dns-style >> prefixing, e.g. "org.cython.X" belongs to us. > > Huh? Isn't that when you *need* interning? Do you plan on key-encoding > those kind of strings into 64 bits? > > (I think it would usually be "method:foo:ii->d" (or my current > preference is "method:foo:i4i8->f8")) Well, I guess something like "org.cython.X" would happen often as well, in addition. Just put it all in the same table :-) > > Partitioning the space numerically you'd just hash the number; "SEP 260: > We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". The real use-case I see for this now is in having the PyArray_DATA etc. access pointers simply through compile-time constants the library can define on both ends. It could just do PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL) specifically to get a function retrieving the data-pointer. PyArray_SHAPE would do PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL) Also, I'd want PyExtensibleType_Object to have: { ... PyPerfectTable *tp_perfect_table; Py_ssize_t tp_perfect_table_obj_offset; } i.e. we allow for getting quickly to a table on the object in addition to the one on the type. Callbacks look up the one on the object first (before potentially checking for __call__ in the type); method-calling might ignore the one on the object. Dag > >> There is value in the double indirection if this (or any of the other) >> lookup tables are meant to be modified over time. > > This isn't impossible with a hash table either. You just need to > reallocate a little more often than what would be the case with a > regular hash table, but not dramatically so (you need to rehash whenever > the element to insert hashes into a "large" bin, which are rather few). > > I want the table to have a pointer to it, so that you can atomically > swap it out. > >>> To wrap up, I think this has grown in complexity beyond the "simple SEP >>> spec". It's at the point where you don't really want to have several >>> libraries implementing the same simple spec, but instead use the same >>> implementation. >>> >>> But I think the advantages are simply too good to give up on. >>> >>> So I think a viable route forward is to forget the >>> CEP/SEP/pre-PEP-approach >>> for now (which only works for semi-complicated ideas with simple >>> implementations) and instead simply work more directly on a library. It >>> would need to have a couple of different use modes: >> >> I prefer an enhancement proposal with a spec over a library, even if >> only a single library gets used in practice. I still think it's simple >> enough. Basically, we have the "lookup spec" and then a CEP for >> applying this to fast callable (agreeing on signatures, and what to do >> with extern types) and extensible type slots. > > OK. > >> >>> - A Python perfect-hasher for use when generating code, with only the a >>> string interner based on CPython dicts and extensibletype metaclass as >>> runtime dependencies (for use in Cython). This would only add some >>> hundred >>> source file lines... >>> >>> - A C implementation of the perfect hashing exposed through a >>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>> implementation of the perfect hashing. >>> >>> And on the distribution axis: >>> >>> - Small C header-style implementation of a string interner and the >>> extensibletype metaclass, rendezvousing through sys.modules >>> >>> - As part of the rendezvous, one would always try to __import__ the >>> *real* >>> run-time library. So if it is available in sys.path it overrides >>> anything >>> bundled with other libraries. That would provide a way forward for >>> GIL-less >>> string interning, a Python-side library for working with these tables >>> and >>> inspecting them, etc. >> >> Hmm, that's an interesting idea. I think we don't actually need >> interning, in which case the "full" library is only needed for >> introspection. > > You don't believe the security concern is real then? Or do you want to > pay the cost for a 160-bit SHA1 compare everywhere? > > I'd love to not do interning, but I see no way around it. > > BTW, a GIL-less interning library isn't rocket science. I ran khash.h > through a preprocessor with > > KHASH_MAP_INIT_STR(str_to_entry, entry_t) > > and the result is 180 lines of code for the hash table. Then pythread.h > provides the thread lock, another 50 lines for the interning logic > (intern_literal, intern_heap_allocated, release_interned). > > It just seems a little redundant to ship such a thing in every > Cython-generated file since we hold the GIL during module loading anyway. > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From d.s.seljebotn at astro.uio.no Sat Jun 9 08:02:07 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 09 Jun 2012 08:02:07 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD2E692.4040404@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <4FD2E692.4040404@astro.uio.no> Message-ID: <4FD2E6DF.5000606@astro.uio.no> On 06/09/2012 08:00 AM, Dag Sverre Seljebotn wrote: > On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote: >> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>>> There's still the indirection through SEP 200 (extensibletype slots). >>>> We can >>>> get rid of that very easily by just making that table and the >>>> hash-vtable >>>> one and the same. (It could still either have interned string keys >>>> or ID >>>> keys depending on the least significant bit.) >>> >>> Or we can even forgo the interning for this table, and give up on >>> partitioning the space numerically and just use the dns-style >>> prefixing, e.g. "org.cython.X" belongs to us. >> >> Huh? Isn't that when you *need* interning? Do you plan on key-encoding >> those kind of strings into 64 bits? >> >> (I think it would usually be "method:foo:ii->d" (or my current >> preference is "method:foo:i4i8->f8")) > > Well, I guess something like "org.cython.X" would happen often as well, > in addition. Just put it all in the same table :-) > >> >> Partitioning the space numerically you'd just hash the number; "SEP 260: >> We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". > > The real use-case I see for this now is in having the PyArray_DATA etc. > access pointers simply through compile-time constants the library can > define on both ends. It could just do > > PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL) > > specifically to get a function retrieving the data-pointer. > PyArray_SHAPE would do > > PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL) Argh. I meant 0x70040002 | 1, of course ;-) DS > > Also, I'd want PyExtensibleType_Object to have: > > { > ... > PyPerfectTable *tp_perfect_table; > Py_ssize_t tp_perfect_table_obj_offset; > } > > i.e. we allow for getting quickly to a table on the object in addition > to the one on the type. > > Callbacks look up the one on the object first (before potentially > checking for __call__ in the type); method-calling might ignore the one > on the object. > > Dag > >> >>> There is value in the double indirection if this (or any of the other) >>> lookup tables are meant to be modified over time. >> >> This isn't impossible with a hash table either. You just need to >> reallocate a little more often than what would be the case with a >> regular hash table, but not dramatically so (you need to rehash whenever >> the element to insert hashes into a "large" bin, which are rather few). >> >> I want the table to have a pointer to it, so that you can atomically >> swap it out. >> >>>> To wrap up, I think this has grown in complexity beyond the "simple SEP >>>> spec". It's at the point where you don't really want to have several >>>> libraries implementing the same simple spec, but instead use the same >>>> implementation. >>>> >>>> But I think the advantages are simply too good to give up on. >>>> >>>> So I think a viable route forward is to forget the >>>> CEP/SEP/pre-PEP-approach >>>> for now (which only works for semi-complicated ideas with simple >>>> implementations) and instead simply work more directly on a library. It >>>> would need to have a couple of different use modes: >>> >>> I prefer an enhancement proposal with a spec over a library, even if >>> only a single library gets used in practice. I still think it's simple >>> enough. Basically, we have the "lookup spec" and then a CEP for >>> applying this to fast callable (agreeing on signatures, and what to do >>> with extern types) and extensible type slots. >> >> OK. >> >>> >>>> - A Python perfect-hasher for use when generating code, with only the a >>>> string interner based on CPython dicts and extensibletype metaclass as >>>> runtime dependencies (for use in Cython). This would only add some >>>> hundred >>>> source file lines... >>>> >>>> - A C implementation of the perfect hashing exposed through a >>>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>>> implementation of the perfect hashing. >>>> >>>> And on the distribution axis: >>>> >>>> - Small C header-style implementation of a string interner and the >>>> extensibletype metaclass, rendezvousing through sys.modules >>>> >>>> - As part of the rendezvous, one would always try to __import__ the >>>> *real* >>>> run-time library. So if it is available in sys.path it overrides >>>> anything >>>> bundled with other libraries. That would provide a way forward for >>>> GIL-less >>>> string interning, a Python-side library for working with these tables >>>> and >>>> inspecting them, etc. >>> >>> Hmm, that's an interesting idea. I think we don't actually need >>> interning, in which case the "full" library is only needed for >>> introspection. >> >> You don't believe the security concern is real then? Or do you want to >> pay the cost for a 160-bit SHA1 compare everywhere? >> >> I'd love to not do interning, but I see no way around it. >> >> BTW, a GIL-less interning library isn't rocket science. I ran khash.h >> through a preprocessor with >> >> KHASH_MAP_INIT_STR(str_to_entry, entry_t) >> >> and the result is 180 lines of code for the hash table. Then pythread.h >> provides the thread lock, another 50 lines for the interning logic >> (intern_literal, intern_heap_allocated, release_interned). >> >> It just seems a little redundant to ship such a thing in every >> Cython-generated file since we hold the GIL during module loading anyway. >> >> Dag >> _______________________________________________ >> cython-devel mailing list >> cython-devel at python.org >> http://mail.python.org/mailman/listinfo/cython-devel > From robertwb at gmail.com Sun Jun 10 09:00:44 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Sun, 10 Jun 2012 00:00:44 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD2E313.6040208@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> Message-ID: On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: > On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >> >> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>> There's still the indirection through SEP 200 (extensibletype slots). We >>> can >>> get rid of that very easily by just making that table and the hash-vtable >>> one and the same. (It could still either have interned string keys or ID >>> keys depending on the least significant bit.) >> >> >> Or we can even forgo the interning for this table, and give up on >> partitioning the space numerically and just use the dns-style >> prefixing, e.g. "org.cython.X" belongs to us. > > > Huh? Isn't that when you *need* interning? Do you plan on key-encoding those > kind of strings into 64 bits? No, use 64-bits of a a cryptographically-secure hash. > (I think it would usually be "method:foo:ii->d" (or my current preference is > "method:foo:i4i8->f8")) Yeah, I was assuming methods wouldn't be specific to Cython. (Putting sizes in the format makes a lot of sense for persistent storage, but I think it's safe to assume that a"long in the provider == a long in the consumer, and this would mean the hashes would have to be computed after (some) C compilation). > Partitioning the space numerically you'd just hash the number; "SEP 260: We > use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". But why bother with the id? >> There is value in the double indirection if this (or any of the other) >> lookup tables are meant to be modified over time. > > > This isn't impossible with a hash table either. You just need to reallocate > a little more often than what would be the case with a regular hash table, > but not dramatically so (you need to rehash whenever the element to insert > hashes into a "large" bin, which are rather few). > > I want the table to have a pointer to it, so that you can atomically swap it > out. I think that's worth a level of indirection. >>> To wrap up, I think this has grown in complexity beyond the "simple SEP >>> spec". It's at the point where you don't really want to have several >>> libraries implementing the same simple spec, but instead use the same >>> implementation. >>> >>> But I think the advantages are simply too good to give up on. >>> >>> So I think a viable route forward is to forget the >>> CEP/SEP/pre-PEP-approach >>> for now (which only works for semi-complicated ideas with simple >>> implementations) and instead simply work more directly on a library. It >>> would need to have a couple of different use modes: >> >> >> I prefer an enhancement proposal with a spec over a library, even if >> only a single library gets used in practice. I still think it's simple >> enough. Basically, we have the "lookup spec" and then a CEP for >> applying this to fast callable (agreeing on signatures, and what to do >> with extern types) and extensible type slots. > > > OK. > > >> >>> ?- A Python perfect-hasher for use when generating code, with only the a >>> string interner based on CPython dicts and extensibletype metaclass as >>> runtime dependencies (for use in Cython). This would only add some >>> hundred >>> source file lines... >>> >>> ?- A C implementation of the perfect hashing exposed through a >>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>> implementation of the perfect hashing. >>> >>> And on the distribution axis: >>> >>> ?- Small C header-style implementation of a string interner and the >>> extensibletype metaclass, rendezvousing through sys.modules >>> >>> ?- As part of the rendezvous, one would always try to __import__ the >>> *real* >>> run-time library. So if it is available in sys.path it overrides anything >>> bundled with other libraries. That would provide a way forward for >>> GIL-less >>> string interning, a Python-side library for working with these tables and >>> inspecting them, etc. >> >> >> Hmm, that's an interesting idea. I think we don't actually need >> interning, in which case the "full" library is only needed for >> introspection. > > > You don't believe the security concern is real then? Or do you want to pay > the cost for a 160-bit SHA1 compare everywhere? > > I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode at a slight penalty. md5 is out because there are known collisions. (Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else (hopefully still shipped with Python). > BTW, a GIL-less interning library isn't rocket science. I ran khash.h > through a preprocessor with > > KHASH_MAP_INIT_STR(str_to_entry, entry_t) > > and the result is 180 lines of code for the hash table. Then pythread.h > provides the thread lock, another 50 lines for the interning logic > (intern_literal, intern_heap_allocated, release_interned). > > It just seems a little redundant to ship such a thing in every > Cython-generated file since we hold the GIL during module loading anyway. It's the rendezvous on the global state that's more messy then the locking (though we do already require that for the metaclass approach of detecting types with extended slots). - Robert From d.s.seljebotn at astro.uio.no Sun Jun 10 09:14:43 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sun, 10 Jun 2012 09:14:43 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> Message-ID: <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> Robert Bradshaw wrote: >On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn > wrote: >> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >>> >>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>>> There's still the indirection through SEP 200 (extensibletype >slots). We >>>> can >>>> get rid of that very easily by just making that table and the >hash-vtable >>>> one and the same. (It could still either have interned string keys >or ID >>>> keys depending on the least significant bit.) >>> >>> >>> Or we can even forgo the interning for this table, and give up on >>> partitioning the space numerically and just use the dns-style >>> prefixing, e.g. "org.cython.X" belongs to us. >> >> >> Huh? Isn't that when you *need* interning? Do you plan on >key-encoding those >> kind of strings into 64 bits? > >No, use 64-bits of a a cryptographically-secure hash. > >> (I think it would usually be "method:foo:ii->d" (or my current >preference is >> "method:foo:i4i8->f8")) > >Yeah, I was assuming methods wouldn't be specific to Cython. (Putting >sizes in the format makes a lot of sense for persistent storage, but I >think it's safe to assume that a"long in the provider == a long in the >consumer, and this would mean the hashes would have to be computed >after (some) C compilation). > >> Partitioning the space numerically you'd just hash the number; "SEP >260: We >> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". > >But why bother with the id? > >>> There is value in the double indirection if this (or any of the >other) >>> lookup tables are meant to be modified over time. >> >> >> This isn't impossible with a hash table either. You just need to >reallocate >> a little more often than what would be the case with a regular hash >table, >> but not dramatically so (you need to rehash whenever the element to >insert >> hashes into a "large" bin, which are rather few). >> >> I want the table to have a pointer to it, so that you can atomically >swap it >> out. > >I think that's worth a level of indirection. > >>>> To wrap up, I think this has grown in complexity beyond the "simple >SEP >>>> spec". It's at the point where you don't really want to have >several >>>> libraries implementing the same simple spec, but instead use the >same >>>> implementation. >>>> >>>> But I think the advantages are simply too good to give up on. >>>> >>>> So I think a viable route forward is to forget the >>>> CEP/SEP/pre-PEP-approach >>>> for now (which only works for semi-complicated ideas with simple >>>> implementations) and instead simply work more directly on a >library. It >>>> would need to have a couple of different use modes: >>> >>> >>> I prefer an enhancement proposal with a spec over a library, even if >>> only a single library gets used in practice. I still think it's >simple >>> enough. Basically, we have the "lookup spec" and then a CEP for >>> applying this to fast callable (agreeing on signatures, and what to >do >>> with extern types) and extensible type slots. >> >> >> OK. >> >> >>> >>>> ?- A Python perfect-hasher for use when generating code, with only >the a >>>> string interner based on CPython dicts and extensibletype metaclass >as >>>> runtime dependencies (for use in Cython). This would only add some >>>> hundred >>>> source file lines... >>>> >>>> ?- A C implementation of the perfect hashing exposed through a >>>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>>> implementation of the perfect hashing. >>>> >>>> And on the distribution axis: >>>> >>>> ?- Small C header-style implementation of a string interner and the >>>> extensibletype metaclass, rendezvousing through sys.modules >>>> >>>> ?- As part of the rendezvous, one would always try to __import__ >the >>>> *real* >>>> run-time library. So if it is available in sys.path it overrides >anything >>>> bundled with other libraries. That would provide a way forward for >>>> GIL-less >>>> string interning, a Python-side library for working with these >tables and >>>> inspecting them, etc. >>> >>> >>> Hmm, that's an interesting idea. I think we don't actually need >>> interning, in which case the "full" library is only needed for >>> introspection. >> >> >> You don't believe the security concern is real then? Or do you want >to pay >> the cost for a 160-bit SHA1 compare everywhere? >> >> I'd love to not do interning, but I see no way around it. > >No, I want to use the lower 64 bits by default, but always have the >top 96 bits around to allow using this mechanism in "secure" mode at a >slight penalty. md5 is out because there are known collisions. (Yes, >sha-1 may succumb sooner rather than later, theoretical weaknesses >have been shown, so we could look to using something else (hopefully >still shipped with Python). But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules? In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay. And if you *do* want to run in secure mode, it will be a lot slower than interning. Dag > >> BTW, a GIL-less interning library isn't rocket science. I ran khash.h >> through a preprocessor with >> >> KHASH_MAP_INIT_STR(str_to_entry, entry_t) >> >> and the result is 180 lines of code for the hash table. Then >pythread.h >> provides the thread lock, another 50 lines for the interning logic >> (intern_literal, intern_heap_allocated, release_interned). >> >> It just seems a little redundant to ship such a thing in every >> Cython-generated file since we hold the GIL during module loading >anyway. > >It's the rendezvous on the global state that's more messy then the >locking (though we do already require that for the metaclass approach >of detecting types with extended slots). > >- Robert >_______________________________________________ >cython-devel mailing list >cython-devel at python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From robertwb at gmail.com Sun Jun 10 09:34:21 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Sun, 10 Jun 2012 00:34:21 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> Message-ID: On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn wrote: > > > Robert Bradshaw wrote: > >>On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >>>> >>>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>>>> There's still the indirection through SEP 200 (extensibletype >>slots). We >>>>> can >>>>> get rid of that very easily by just making that table and the >>hash-vtable >>>>> one and the same. (It could still either have interned string keys >>or ID >>>>> keys depending on the least significant bit.) >>>> >>>> >>>> Or we can even forgo the interning for this table, and give up on >>>> partitioning the space numerically and just use the dns-style >>>> prefixing, e.g. "org.cython.X" belongs to us. >>> >>> >>> Huh? Isn't that when you *need* interning? Do you plan on >>key-encoding those >>> kind of strings into 64 bits? >> >>No, use 64-bits of a a cryptographically-secure hash. >> >>> (I think it would usually be "method:foo:ii->d" (or my current >>preference is >>> "method:foo:i4i8->f8")) >> >>Yeah, I was assuming methods wouldn't be specific to Cython. (Putting >>sizes in the format makes a lot of sense for persistent storage, but I >>think it's safe to assume that a"long in the provider == a long in the >>consumer, and this would mean the hashes would have to be computed >>after (some) C compilation). >> >>> Partitioning the space numerically you'd just hash the number; "SEP >>260: We >>> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". >> >>But why bother with the id? >> >>>> There is value in the double indirection if this (or any of the >>other) >>>> lookup tables are meant to be modified over time. >>> >>> >>> This isn't impossible with a hash table either. You just need to >>reallocate >>> a little more often than what would be the case with a regular hash >>table, >>> but not dramatically so (you need to rehash whenever the element to >>insert >>> hashes into a "large" bin, which are rather few). >>> >>> I want the table to have a pointer to it, so that you can atomically >>swap it >>> out. >> >>I think that's worth a level of indirection. >> >>>>> To wrap up, I think this has grown in complexity beyond the "simple >>SEP >>>>> spec". It's at the point where you don't really want to have >>several >>>>> libraries implementing the same simple spec, but instead use the >>same >>>>> implementation. >>>>> >>>>> But I think the advantages are simply too good to give up on. >>>>> >>>>> So I think a viable route forward is to forget the >>>>> CEP/SEP/pre-PEP-approach >>>>> for now (which only works for semi-complicated ideas with simple >>>>> implementations) and instead simply work more directly on a >>library. It >>>>> would need to have a couple of different use modes: >>>> >>>> >>>> I prefer an enhancement proposal with a spec over a library, even if >>>> only a single library gets used in practice. I still think it's >>simple >>>> enough. Basically, we have the "lookup spec" and then a CEP for >>>> applying this to fast callable (agreeing on signatures, and what to >>do >>>> with extern types) and extensible type slots. >>> >>> >>> OK. >>> >>> >>>> >>>>> ?- A Python perfect-hasher for use when generating code, with only >>the a >>>>> string interner based on CPython dicts and extensibletype metaclass >>as >>>>> runtime dependencies (for use in Cython). This would only add some >>>>> hundred >>>>> source file lines... >>>>> >>>>> ?- A C implementation of the perfect hashing exposed through a >>>>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>>>> implementation of the perfect hashing. >>>>> >>>>> And on the distribution axis: >>>>> >>>>> ?- Small C header-style implementation of a string interner and the >>>>> extensibletype metaclass, rendezvousing through sys.modules >>>>> >>>>> ?- As part of the rendezvous, one would always try to __import__ >>the >>>>> *real* >>>>> run-time library. So if it is available in sys.path it overrides >>anything >>>>> bundled with other libraries. That would provide a way forward for >>>>> GIL-less >>>>> string interning, a Python-side library for working with these >>tables and >>>>> inspecting them, etc. >>>> >>>> >>>> Hmm, that's an interesting idea. I think we don't actually need >>>> interning, in which case the "full" library is only needed for >>>> introspection. >>> >>> >>> You don't believe the security concern is real then? Or do you want >>to pay >>> the cost for a 160-bit SHA1 compare everywhere? >>> >>> I'd love to not do interning, but I see no way around it. >> >>No, I want to use the lower 64 bits by default, but always have the >>top 96 bits around to allow using this mechanism in "secure" mode at a >>slight penalty. md5 is out because there are known collisions. (Yes, >>sha-1 may succumb sooner rather than later, theoretical weaknesses >>have been shown, so we could look to using something else (hopefully >>still shipped with Python). > > But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules? > > In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay. Yes, it's not ideal from this perspective. > And if you *do* want to run in secure mode, it will be a lot slower than interning. Are you thinking that the 64-bit interned pointer would be used as the hash? In this case all hashtables would have to be constructed at runtime, which means it needs to be really, really cheap (well under a milisecond, I'm sure Sage has >1000 classes, >10000 methods it imports at startup). Also I'm not sure how the very-uneven distribution would play out for constructing perfect hastables (perhaps it won't hurt, there's likely to be long runs of consecutive values in some cases. - Robert From d.s.seljebotn at astro.uio.no Sun Jun 10 10:00:36 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sun, 10 Jun 2012 10:00:36 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> Message-ID: <4FD45424.9040909@astro.uio.no> On 06/10/2012 09:34 AM, Robert Bradshaw wrote: > On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn > wrote: >> >> >> Robert Bradshaw wrote: >> >>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>> wrote: >>>> I'd love to not do interning, but I see no way around it. >>> >>> No, I want to use the lower 64 bits by default, but always have the >>> top 96 bits around to allow using this mechanism in "secure" mode at a >>> slight penalty. md5 is out because there are known collisions. (Yes, >>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>> have been shown, so we could look to using something else (hopefully >>> still shipped with Python). >> >> But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules? >> >> In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay. > > Yes, it's not ideal from this perspective. > >> And if you *do* want to run in secure mode, it will be a lot slower than interning. > > Are you thinking that the 64-bit interned pointer would be used as the > hash? In this case all hashtables would have to be constructed at > runtime, which means it needs to be really, really cheap (well under a > milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports > at startup). Also I'm not sure how the very-uneven distribution would > play out for constructing perfect hastables (perhaps it won't hurt, > there's likely to be long runs of consecutive values in some cases. No, I'm thinking that callsites need both the 64-bit interned char* and the 64-bit hash of the *contents*. They use the hash to figure out the position, then compare by ID. The hash is not stored in callees, it's discarded after figuring out the table layout. (There was this idea that if the char* has least significant bit set, we'd hash it directly rather than dereference it, but let's ignore that for now.) I don't think under a millisecond is unfeasible to hash smallish tables -- we could put the pointer through a cheap hash to create more entropy (for the perfect hashing, being able to select a hash function through the >>r is important, so you can't just use the pointer directly -- but there are functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) That would save us a register and make the instructions shorter in some places I guess...I think it's really miniscule, it's not like the effect of load of a global variable. But if you like this approach I can benchmark C-written hashtable creation and see. Dag From robertwb at gmail.com Sun Jun 10 10:23:47 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Sun, 10 Jun 2012 01:23:47 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD45424.9040909@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> Message-ID: On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn wrote: > On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >> >> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >> ?wrote: >>> >>> >>> >>> Robert Bradshaw ?wrote: >>> >>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>>> ?wrote: >>>>> >>>>> I'd love to not do interning, but I see no way around it. >>>> >>>> >>>> No, I want to use the lower 64 bits by default, but always have the >>>> top 96 bits around to allow using this mechanism in "secure" mode at a >>>> slight penalty. md5 is out because there are known collisions. (Yes, >>>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>>> have been shown, so we could look to using something else (hopefully >>>> still shipped with Python). >>> >>> >>> But very few users are going to know about this. What's the odds that the >>> user who decide to trigger JIT-compilation with function signatures that >>> varies based on the input will know about the option and turn it on and also >>> recompile all his/her C extension modules? >>> >>> In practice, such an option would always stay at its default value. If we >>> leave it to secure by default and start teaching it to users from the >>> start...but that's a big price to pay. >> >> >> Yes, it's not ideal from this perspective. >> >>> And if you *do* want to run in secure mode, it will be a lot slower than >>> interning. >> >> >> Are you thinking that the 64-bit interned pointer would be used as the >> hash? In this case all hashtables would have to be constructed at >> runtime, which means it needs to be really, really cheap (well under a >> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports >> at startup). Also I'm not sure how the very-uneven distribution would >> play out for constructing perfect hastables (perhaps it won't hurt, >> there's likely to be long runs of consecutive values in some cases. > > > No, I'm thinking that callsites need both the 64-bit interned char* and the > 64-bit hash of the *contents*. They use the hash to figure out the position, > then compare by ID. Ah, I missed that bit. OK, yes, that could work well. > The hash is not stored in callees, it's discarded after figuring out the > table layout. > > (There was this idea that if the char* has least significant bit set, we'd > hash it directly rather than dereference it, but let's ignore that for now.) (For the purpose of this discussion, it's part of the "interning" step.) > I don't think under a millisecond is unfeasible to hash smallish tables -- > we could put the pointer through a cheap hash to create more entropy (for > the perfect hashing, being able to select a hash function through the >>r is > important, so you can't just use the pointer directly -- but there are > functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) Just a sec, we're not hashing pointers, but the full signature itself, right? For our hash function we need (1) Collision free on 64-bits (for non-malicious use). (2) Good distribution (including for short strings, which is harder to come by). (2b) Any small subset of bits should have property (2). (3) Ideally easy to reference (e.g. "md5" is better than "these 100 lines of C code"). Cheap runtime construction is still ideal, but much less of an issue if hashes (and perfect tables) can be constructed at compile time, which I think this scheme allows. > That would save us a register and make the instructions shorter in some > places I guess...I think it's really miniscule, it's not like the effect of > load of a global variable. But if you like this approach I can benchmark > C-written hashtable creation and see. This will have value in and of itself (both the implementation and the benchmarks). - Robert From d.s.seljebotn at astro.uio.no Sun Jun 10 10:26:55 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sun, 10 Jun 2012 10:26:55 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD45424.9040909@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> Message-ID: <0c97966b-4c3a-4577-9673-726a31c49e23@email.android.com> Dag Sverre Seljebotn wrote: >On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >> wrote: >>> >>> >>> Robert Bradshaw wrote: >>> >>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>>> wrote: >>>>> I'd love to not do interning, but I see no way around it. >>>> >>>> No, I want to use the lower 64 bits by default, but always have the >>>> top 96 bits around to allow using this mechanism in "secure" mode >at a >>>> slight penalty. md5 is out because there are known collisions. >(Yes, >>>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>>> have been shown, so we could look to using something else >(hopefully >>>> still shipped with Python). >>> >>> But very few users are going to know about this. What's the odds >that the user who decide to trigger JIT-compilation with function >signatures that varies based on the input will know about the option >and turn it on and also recompile all his/her C extension modules? >>> >>> In practice, such an option would always stay at its default value. >If we leave it to secure by default and start teaching it to users from >the start...but that's a big price to pay. >> >> Yes, it's not ideal from this perspective. >> >>> And if you *do* want to run in secure mode, it will be a lot slower >than interning. >> >> Are you thinking that the 64-bit interned pointer would be used as >the >> hash? In this case all hashtables would have to be constructed at >> runtime, which means it needs to be really, really cheap (well under >a >> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports >> at startup). Also I'm not sure how the very-uneven distribution would >> play out for constructing perfect hastables (perhaps it won't hurt, >> there's likely to be long runs of consecutive values in some cases. > >No, I'm thinking that callsites need both the 64-bit interned char* and > >the 64-bit hash of the *contents*. They use the hash to figure out the >position, then compare by ID. > >The hash is not stored in callees, it's discarded after figuring out >the >table layout. > >(There was this idea that if the char* has least significant bit set, >we'd hash it directly rather than dereference it, but let's ignore that > >for now.) > >I don't think under a millisecond is unfeasible to hash smallish tables > >-- we could put the pointer through a cheap hash to create more entropy > >(for the perfect hashing, being able to select a hash function through >the >>r is important, so you can't just use the pointer directly -- but > >there are functions cheaper than md5, e.g, in here: >http://code.google.com/p/ulib/) > >That would save us a register and make the instructions shorter in some > >places I guess...I think it's really miniscule, it's not like the >effect >of load of a global variable. But if you like this approach I can >benchmark C-written hashtable creation and see. I don't know what I was thinking. Tha callsite can't hash every time, and the pointer doesn't contain enough entropy for perfect hashing, so hashing the pointer has only disadvantages. I really think the call site should have both a hash and a separate interned ID. And if the caller knows the entry should be there, it can skip the ID check and only needs the hash. That makes the table pretty slick for non-smart callers too, it would be (id, flags, ptr)-entries, and callers could either do strcmp or interning, with or without hashing. (I realize the information would be there in your proposal too, but this would be slimmer). Dag > >Dag >_______________________________________________ >cython-devel mailing list >cython-devel at python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From d.s.seljebotn at astro.uio.no Sun Jun 10 10:43:29 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sun, 10 Jun 2012 10:43:29 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> Message-ID: <4FD45E31.8060506@astro.uio.no> On 06/10/2012 10:23 AM, Robert Bradshaw wrote: > On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn > wrote: >> On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >>> >>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >>> wrote: >>>> >>>> >>>> >>>> Robert Bradshaw wrote: >>>> >>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>>>> wrote: >>>>>> >>>>>> I'd love to not do interning, but I see no way around it. >>>>> >>>>> >>>>> No, I want to use the lower 64 bits by default, but always have the >>>>> top 96 bits around to allow using this mechanism in "secure" mode at a >>>>> slight penalty. md5 is out because there are known collisions. (Yes, >>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>>>> have been shown, so we could look to using something else (hopefully >>>>> still shipped with Python). >>>> >>>> >>>> But very few users are going to know about this. What's the odds that the >>>> user who decide to trigger JIT-compilation with function signatures that >>>> varies based on the input will know about the option and turn it on and also >>>> recompile all his/her C extension modules? >>>> >>>> In practice, such an option would always stay at its default value. If we >>>> leave it to secure by default and start teaching it to users from the >>>> start...but that's a big price to pay. >>> >>> >>> Yes, it's not ideal from this perspective. >>> >>>> And if you *do* want to run in secure mode, it will be a lot slower than >>>> interning. >>> >>> >>> Are you thinking that the 64-bit interned pointer would be used as the >>> hash? In this case all hashtables would have to be constructed at >>> runtime, which means it needs to be really, really cheap (well under a >>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports >>> at startup). Also I'm not sure how the very-uneven distribution would >>> play out for constructing perfect hastables (perhaps it won't hurt, >>> there's likely to be long runs of consecutive values in some cases. >> >> >> No, I'm thinking that callsites need both the 64-bit interned char* and the >> 64-bit hash of the *contents*. They use the hash to figure out the position, >> then compare by ID. > > Ah, I missed that bit. OK, yes, that could work well. Ah, we've been talking past one another for some time then. OK, let's settle on that. > >> The hash is not stored in callees, it's discarded after figuring out the >> table layout. >> >> (There was this idea that if the char* has least significant bit set, we'd >> hash it directly rather than dereference it, but let's ignore that for now.) > > (For the purpose of this discussion, it's part of the "interning" step.) > >> I don't think under a millisecond is unfeasible to hash smallish tables -- >> we could put the pointer through a cheap hash to create more entropy (for >> the perfect hashing, being able to select a hash function through the>>r is >> important, so you can't just use the pointer directly -- but there are >> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) > > Just a sec, we're not hashing pointers, but the full signature itself, > right? For our hash function we need > > (1) Collision free on 64-bits (for non-malicious use). > (2) Good distribution (including for short strings, which is harder to come by). > (2b) Any small subset of bits should have property (2). > (3) Ideally easy to reference (e.g. "md5" is better than "these 100 > lines of C code"). > > Cheap runtime construction is still ideal, but much less of an issue > if hashes (and perfect tables) can be constructed at compile time, > which I think this scheme allows. Yes, 64 bits of md5 then? ulib contains "100 lines of C code" for md5 anyway, if one doesn't want to go through Python hashlib (I imagine e.g. hashlib might be unavailable somewhere as it relies on openssl and there's license war going on vs. gnutls and so on. And the md5 module is deprecated.). > >> That would save us a register and make the instructions shorter in some >> places I guess...I think it's really miniscule, it's not like the effect of >> load of a global variable. But if you like this approach I can benchmark >> C-written hashtable creation and see. > > This will have value in and of itself (both the implementation and the > benchmarks). Will do (eventually, less spare time in coming week). About signatures, a problem I see with following the C typing is that the signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" on 32-bit Linux, and so on. I think that would be really bad. "l" must be banished -- but then one might as well do "i4i8i8". Designing a signature hash where you select between these at compile-time is perhaps doable but does generate a lot of code and makes everything complicated. I think we should just start off with hashing at module load time when sizes are known, and then work with heuristics and/or build system integration to improve on that afterwards. Dag From robertwb at gmail.com Sun Jun 10 11:53:12 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Sun, 10 Jun 2012 02:53:12 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD45E31.8060506@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCD20DC.6090906@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> Message-ID: On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn wrote: > On 06/10/2012 10:23 AM, Robert Bradshaw wrote: >> >> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn >> ?wrote: >>> >>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >>>> >>>> >>>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >>>> ? ?wrote: >>>>> >>>>> >>>>> >>>>> >>>>> Robert Bradshaw ? ?wrote: >>>>> >>>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>>>>> ? ?wrote: >>>>>>> >>>>>>> >>>>>>> I'd love to not do interning, but I see no way around it. >>>>>> >>>>>> >>>>>> >>>>>> No, I want to use the lower 64 bits by default, but always have the >>>>>> top 96 bits around to allow using this mechanism in "secure" mode at a >>>>>> slight penalty. md5 is out because there are known collisions. (Yes, >>>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>>>>> have been shown, so we could look to using something else (hopefully >>>>>> still shipped with Python). >>>>> >>>>> >>>>> >>>>> But very few users are going to know about this. What's the odds that >>>>> the >>>>> user who decide to trigger JIT-compilation with function signatures >>>>> that >>>>> varies based on the input will know about the option and turn it on and >>>>> also >>>>> recompile all his/her C extension modules? >>>>> >>>>> In practice, such an option would always stay at its default value. If >>>>> we >>>>> leave it to secure by default and start teaching it to users from the >>>>> start...but that's a big price to pay. >>>> >>>> >>>> >>>> Yes, it's not ideal from this perspective. >>>> >>>>> And if you *do* want to run in secure mode, it will be a lot slower >>>>> than >>>>> interning. >>>> >>>> >>>> >>>> Are you thinking that the 64-bit interned pointer would be used as the >>>> hash? In this case all hashtables would have to be constructed at >>>> runtime, which means it needs to be really, really cheap (well under a >>>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports >>>> at startup). Also I'm not sure how the very-uneven distribution would >>>> play out for constructing perfect hastables (perhaps it won't hurt, >>>> there's likely to be long runs of consecutive values in some cases. >>> >>> >>> >>> No, I'm thinking that callsites need both the 64-bit interned char* and >>> the >>> 64-bit hash of the *contents*. They use the hash to figure out the >>> position, >>> then compare by ID. >> >> >> Ah, I missed that bit. OK, yes, that could work well. > > > Ah, we've been talking past one another for some time then. OK, let's settle > on that. > > >> >>> The hash is not stored in callees, it's discarded after figuring out the >>> table layout. >>> >>> (There was this idea that if the char* has least significant bit set, >>> we'd >>> hash it directly rather than dereference it, but let's ignore that for >>> now.) >> >> >> (For the purpose of this discussion, it's part of the "interning" step.) >> >>> I don't think under a millisecond is unfeasible to hash smallish tables >>> -- >>> we could put the pointer through a cheap hash to create more entropy (for >>> the perfect hashing, being able to select a hash function through the>>r >>> is >>> important, so you can't just use the pointer directly -- but there are >>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) >> >> >> Just a sec, we're not hashing pointers, but the full signature itself, >> right? For our hash function we need >> >> (1) Collision free on 64-bits (for non-malicious use). >> (2) Good distribution (including for short strings, which is harder to >> come by). >> (2b) Any small subset of bits should have property (2). >> (3) Ideally easy to reference (e.g. "md5" is better than "these 100 >> lines of C code"). >> >> Cheap runtime construction is still ideal, but much less of an issue >> if hashes (and perfect tables) can be constructed at compile time, >> which I think this scheme allows. > > > Yes, 64 bits of md5 then? +1 for me. > ulib contains "100 lines of C code" for md5 > anyway, if one doesn't want to go through Python hashlib (I imagine e.g. > hashlib might be unavailable somewhere as it relies on openssl and there's > license war going on vs. gnutls and so on. And the md5 module is > deprecated.). Just the interface, right? (hashlib should be used instead...) >>> That would save us a register and make the instructions shorter in some >>> places I guess...I think it's really miniscule, it's not like the effect >>> of >>> load of a global variable. But if you like this approach I can benchmark >>> C-written hashtable creation and see. >> >> >> This will have value in and of itself (both the implementation and the >> benchmarks). > > > Will do (eventually, less spare time in coming week). > > About signatures, a problem I see with following the C typing is that the > signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" > on 32-bit Linux, and so on. I think that would be really bad. This is why I suggested promotion for scalars (divide ints into <=sizeof(long) and sizeof(long) < x <= sizeof(long long)), checked at C compile time, though I guess you consider that evil. I don't consider not matching really bad, just kind of bad. > "l" must be banished -- but then one might as well do "i4i8i8". > > Designing a signature hash where you select between these at compile-time is > perhaps doable but does generate a lot of code and makes everything > complicated. It especially gets messy when you're trying to pre-compute tables. > I think we should just start off with hashing at module load > time when sizes are known, and then work with heuristics and/or build system > integration to improve on that afterwards. Finding 10,000 optimal tables at runtime better be really cheap than for Sage's sake :). - Robert From pav at iki.fi Mon Jun 11 21:08:33 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:08:33 +0200 Subject: [Cython] Failure with asarray(memoryview) on Python 2.4 Message-ID: Hi, This doesn't work on Python 2.4 (works on >= 2.5): ------------ cimport numpy as np import numpy as np def _foo(): cdef double[:] a a = np.array([1.0]) return np.asarray(a) def foo(): print _foo() ------------ Spotted when using Cython 1.6 in Scipy. Results to: Python 2.4.6 (#1, Nov 20 2010, 00:52:41) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import fail >>> fail.foo() Traceback (most recent call last): File "", line 1, in ? File "fail.pyx", line 10, in fail.foo (fail.c:1776) print _foo() File "fail.pyx", line 7, in fail._foo (fail.c:1715) return np.asarray(a) File "/usr/local/stow/python-easy-install//lib/python2.4/site-packages/numpy/core/numeric.py", line 235, in asarray return array(a, dtype, copy=False, order=order) File "stringsource", line 366, in View.MemoryView.memoryview.__getitem__ (fail.c:5975) File "stringsource", line 650, in View.MemoryView._unellipsify (fail.c:9236) TypeError: Cannot index with type '' From pav at iki.fi Mon Jun 11 21:12:38 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:12:38 +0200 Subject: [Cython] Cython 1.6 & compilation failure on MinGW? Message-ID: Hi, We ran with Scipy to a compilation failure on MinGW in Cython code: http://projects.scipy.org/scipy/ticket/1673 interpnd.c:10580: error: initializer element is not constant interpnd.c:10580: error: (near initialization for `__pyx_CyFunctionType_type.tp_call') Can be fixed like this: ... +static PyObject *__Pyx_PyCFunction_Call_wrap(PyObject *a, PyObject *b, PyObject *c) +{ + return __Pyx_PyCFunction_Call(a, b, c); +} static PyTypeObject __pyx_CyFunctionType_type = { PyVarObject_HEAD_INIT(0, 0) __Pyx_NAMESTR("cython_function_or_method"), @@ -10577,7 +10581,7 @@ static PyTypeObject __pyx_CyFunctionType_type = { 0, 0, 0, - __Pyx_PyCFunction_Call, + __Pyx_PyCFunction_Call_wrap, 0, 0, 0, ... It's a bit surprising to me that you cannot use the function from the Python headers as a static initializer on that platform... -- Pauli Virtanen From markflorisson88 at gmail.com Mon Jun 11 21:16:37 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Mon, 11 Jun 2012 20:16:37 +0100 Subject: [Cython] Failure with asarray(memoryview) on Python 2.4 In-Reply-To: References: Message-ID: On 11 June 2012 20:08, Pauli Virtanen wrote: > Hi, > > This doesn't work on Python 2.4 (works on >= 2.5): > > ------------ > cimport numpy as np > import numpy as np > > def _foo(): > ? ?cdef double[:] a > ? ?a = np.array([1.0]) > ? ?return np.asarray(a) > > def foo(): > ? ?print _foo() > ------------ > > Spotted when using Cython 1.6 in Scipy. Results to: > > Python 2.4.6 (#1, Nov 20 2010, 00:52:41) > [GCC 4.4.5] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import fail >>>> fail.foo() > Traceback (most recent call last): > ?File "", line 1, in ? > ?File "fail.pyx", line 10, in fail.foo (fail.c:1776) > ? ?print _foo() > ?File "fail.pyx", line 7, in fail._foo (fail.c:1715) > ? ?return np.asarray(a) > ?File > "/usr/local/stow/python-easy-install//lib/python2.4/site-packages/numpy/core/numeric.py", > line 235, in asarray > ? ?return array(a, dtype, copy=False, order=order) > ?File "stringsource", line 366, in > View.MemoryView.memoryview.__getitem__ (fail.c:5975) > ?File "stringsource", line 650, in View.MemoryView._unellipsify > (fail.c:9236) > TypeError: Cannot index with type '' > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Hey Pauli, Yeah, there was some weird bug with PyIndex_Check() not operating properly. Could you retry with the latest master? Mark From markflorisson88 at gmail.com Mon Jun 11 21:17:49 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Mon, 11 Jun 2012 20:17:49 +0100 Subject: [Cython] Cython 1.6 & compilation failure on MinGW? In-Reply-To: References: Message-ID: On 11 June 2012 20:12, Pauli Virtanen wrote: > Hi, > > We ran with Scipy to a compilation failure on MinGW in Cython code: > > http://projects.scipy.org/scipy/ticket/1673 > > interpnd.c:10580: error: initializer element is not constant > interpnd.c:10580: error: (near initialization for > `__pyx_CyFunctionType_type.tp_call') > > Can be fixed like this: > > ... > +static PyObject *__Pyx_PyCFunction_Call_wrap(PyObject *a, PyObject *b, > PyObject *c) > +{ > + ? ?return __Pyx_PyCFunction_Call(a, b, c); > +} > ?static PyTypeObject __pyx_CyFunctionType_type = { > ? ? PyVarObject_HEAD_INIT(0, 0) > ? ? __Pyx_NAMESTR("cython_function_or_method"), > @@ -10577,7 +10581,7 @@ static PyTypeObject __pyx_CyFunctionType_type = { > ? ? 0, > ? ? 0, > ? ? 0, > - ? ?__Pyx_PyCFunction_Call, > + ? ?__Pyx_PyCFunction_Call_wrap, > ? ? 0, > ? ? 0, > ? ? 0, > ... > > > It's a bit surprising to me that you cannot use the function from the > Python headers as a static initializer on that platform... > > -- > Pauli Virtanen > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Thanks, could you provide a pull request? That makes it easier to merge and assign credit. From pav at iki.fi Mon Jun 11 21:23:53 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:23:53 +0200 Subject: [Cython] Failure with asarray(memoryview) on Python 2.4 In-Reply-To: References: Message-ID: Hi, 11.06.2012 21:16, mark florisson kirjoitti: [clip] > Yeah, there was some weird bug with PyIndex_Check() not operating > properly. Could you retry with the latest master? Doesn't seem to work in 5a0effd0 :( Traceback (most recent call last): File "", line 1, in ? File "fail.pyx", line 10, in fail.foo (fail.c:1807) print _foo() File "fail.pyx", line 7, in fail._foo (fail.c:1747) return np.asarray(a) File "/usr/local/stow/python-easy-install//lib/python2.4/site-packages/numpy/core/numeric.py", line 235, in asarray return array(a, dtype, copy=False, order=order) File "stringsource", line 366, in View.MemoryView.memoryview.__getitem__ (fail.c:6019) File "stringsource", line 650, in View.MemoryView._unellipsify (fail.c:9199) TypeError: Cannot index with type '' Cheers, Pauli From pav at iki.fi Mon Jun 11 21:24:56 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:24:56 +0200 Subject: [Cython] Cython 1.6 & compilation failure on MinGW? In-Reply-To: References: Message-ID: 11.06.2012 21:17, mark florisson kirjoitti: [clip] > Thanks, could you provide a pull request? That makes it easier to > merge and assign credit. Ok, I'll try to not only just complain :) BRB, Pauli From pav at iki.fi Mon Jun 11 21:27:18 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:27:18 +0200 Subject: [Cython] Cython 1.6 & compilation failure on MinGW? In-Reply-To: References: Message-ID: 11.06.2012 21:17, mark florisson kirjoitti: [clip] > Thanks, could you provide a pull request? That makes it easier to > merge and assign credit. Ok, this one seemed to already have been fixed in Cython master. Pauli From pav at iki.fi Mon Jun 11 21:55:57 2012 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 11 Jun 2012 21:55:57 +0200 Subject: [Cython] Failure with asarray(memoryview) on Python 2.4 In-Reply-To: References: Message-ID: 11.06.2012 21:23, Pauli Virtanen kirjoitti: > Hi, > > 11.06.2012 21:16, mark florisson kirjoitti: > [clip] >> Yeah, there was some weird bug with PyIndex_Check() not operating >> properly. Could you retry with the latest master? > > Doesn't seem to work in 5a0effd0 :( [clip] Uhh, Numpy header arrayobject.h -> npy_common.h contains this #if (PY_VERSION_HEX < 0x02050000) ... #undef PyIndex_Check #define PyIndex_Check(op) 0 ... which nicely overrides the fixed PyIndex_Check defined by Cython. Time to fix that, I guess. I don't see reasonable ways to work around this in Cython... Pauli From markflorisson88 at gmail.com Mon Jun 11 22:02:26 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Mon, 11 Jun 2012 21:02:26 +0100 Subject: [Cython] Failure with asarray(memoryview) on Python 2.4 In-Reply-To: References: Message-ID: On 11 June 2012 20:55, Pauli Virtanen wrote: > 11.06.2012 21:23, Pauli Virtanen kirjoitti: >> Hi, >> >> 11.06.2012 21:16, mark florisson kirjoitti: >> [clip] >>> Yeah, there was some weird bug with PyIndex_Check() not operating >>> properly. Could you retry with the latest master? >> >> Doesn't seem to work in 5a0effd0 ?:( > [clip] > > Uhh, Numpy header arrayobject.h -> npy_common.h contains this > > #if (PY_VERSION_HEX < 0x02050000) > ... > #undef PyIndex_Check > #define PyIndex_Check(op) 0 > ... > > which nicely overrides the fixed PyIndex_Check defined by Cython. > Time to fix that, I guess. > > I don't see reasonable ways to work around this in Cython... > > ? ? ? ?Pauli > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Ah, thanks! Stefan and I were kind of baffled by PyIndex_Check failing, I guess we should have run cpp on our source :) From d.s.seljebotn at astro.uio.no Tue Jun 12 13:01:45 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 12 Jun 2012 13:01:45 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <048eeb04-aa8b-4e12-9a9b-5d552d39984b@email.android.com> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> Message-ID: <4FD72199.7010803@astro.uio.no> On 06/10/2012 11:53 AM, Robert Bradshaw wrote: > On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn >> About signatures, a problem I see with following the C typing is that the >> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" >> on 32-bit Linux, and so on. I think that would be really bad. > > This is why I suggested promotion for scalars (divide ints into > <=sizeof(long) and sizeof(long)< x<= sizeof(long long)), checked at > C compile time, though I guess you consider that evil. I don't > consider not matching really bad, just kind of bad. Right. At least a convention for promotion of scalars would be good anyway. Even MSVC supports stdint.h these days; basing ourselves on the random behaviour of "long" seems a bit outdated to me. "ssize_t" would be better motivated I feel. Many linear algebra libraries use 32-bit matrix indices by default, but can be swapped to 64-bit indices (this holds for many LAPACK implementations and most sparse linear algebra). So often there will at least be one typedef that is either 32 bits or 64 bits without the Cython compiler knowing. Promoting to a single type "[u]int64" is the only one that removes possible combinatorial explosion if you have multiple external typedefs that you don't know the size of (although I guess that's rather theoretical). Anyway, runtime table generation is quite fast, see below. > >> "l" must be banished -- but then one might as well do "i4i8i8". >> >> Designing a signature hash where you select between these at compile-time is >> perhaps doable but does generate a lot of code and makes everything >> complicated. > > It especially gets messy when you're trying to pre-compute tables. > >> I think we should just start off with hashing at module load >> time when sizes are known, and then work with heuristics and/or build system >> integration to improve on that afterwards. > > Finding 10,000 optimal tables at runtime better be really cheap than > for Sage's sake :). The code is highly unpolished as I write this, but it works so here's some preliminary benchmarks. Assuming the 64-bit pre-hashes are already computed, hashing a 64-slot table varies between 5 and 10 us (microseconds) depending on the set of hashes. Computing md5's with C code from ulib (not hashlib/OpenSSL) takes ~400ns per hash, so 26 us for the 64-slot table => it dominates! The crapwow64 hash takes ~10-20 ns, for ~1 us per 64-slot table. Admittedly, that's with hand-written non-portable assembly for the crapwow64. Assuming 10 000 64-slot tables we're looking at something like 0.3-0.4 seconds for loading Sage using md5, or 0.1 seconds using crapwow64. https://github.com/dagss/pyextensibletype/blob/master/include/perfecthash.h http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow64.html Dag From d.s.seljebotn at astro.uio.no Tue Jun 12 19:21:48 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 12 Jun 2012 19:21:48 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD72199.7010803@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> Message-ID: <4FD77AAC.6080905@astro.uio.no> On 06/12/2012 01:01 PM, Dag Sverre Seljebotn wrote: > On 06/10/2012 11:53 AM, Robert Bradshaw wrote: >> On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn >>> About signatures, a problem I see with following the C typing is that >>> the >>> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and >>> "iqq" >>> on 32-bit Linux, and so on. I think that would be really bad. >> >> This is why I suggested promotion for scalars (divide ints into >> <=sizeof(long) and sizeof(long)< x<= sizeof(long long)), checked at >> C compile time, though I guess you consider that evil. I don't >> consider not matching really bad, just kind of bad. > > Right. At least a convention for promotion of scalars would be good anyway. > > Even MSVC supports stdint.h these days; basing ourselves on the random > behaviour of "long" seems a bit outdated to me. "ssize_t" would be > better motivated I feel. > > Many linear algebra libraries use 32-bit matrix indices by default, but > can be swapped to 64-bit indices (this holds for many LAPACK > implementations and most sparse linear algebra). So often there will at > least be one typedef that is either 32 bits or 64 bits without the > Cython compiler knowing. > > Promoting to a single type "[u]int64" is the only one that removes > possible combinatorial explosion if you have multiple external typedefs > that you don't know the size of (although I guess that's rather > theoretical). > > Anyway, runtime table generation is quite fast, see below. > >> >>> "l" must be banished -- but then one might as well do "i4i8i8". >>> >>> Designing a signature hash where you select between these at >>> compile-time is >>> perhaps doable but does generate a lot of code and makes everything >>> complicated. >> >> It especially gets messy when you're trying to pre-compute tables. >> >>> I think we should just start off with hashing at module load >>> time when sizes are known, and then work with heuristics and/or build >>> system >>> integration to improve on that afterwards. >> >> Finding 10,000 optimal tables at runtime better be really cheap than >> for Sage's sake :). > > The code is highly unpolished as I write this, but it works so here's > some preliminary benchmarks. > > Assuming the 64-bit pre-hashes are already computed, hashing a 64-slot > table varies between 5 and 10 us (microseconds) depending on the set of > hashes. > > Computing md5's with C code from ulib (not hashlib/OpenSSL) takes ~400ns > per hash, so 26 us for the 64-slot table => it dominates! > > The crapwow64 hash takes ~10-20 ns, for ~1 us per 64-slot table. > Admittedly, that's with hand-written non-portable assembly for the > crapwow64. > > Assuming 10 000 64-slot tables we're looking at something like 0.3-0.4 > seconds for loading Sage using md5, or 0.1 seconds using crapwow64. > > https://github.com/dagss/pyextensibletype/blob/master/include/perfecthash.h > > http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow64.html Look: A big advantage of the hash-vtables is that subclasses stay ABI-compatible with superclasses, and don't need recompilation when superclasses adds or removes methods. => Finding the hash table must happen at run-time in a lot of cases anyway, so I feel Robert's chase for a compile-time table building is moot. I feel this would also need to trigger automatically heap-allocated tables if the statically allocated. Which is good to have in the very few cases where a perfect table can't be found too. One thing is that, which makes me feel uneasy about the relatively unexplored crapwow64 is that we really don't want collisions in the 64-bit prehashes within a single table (which would raise an exception -- which I think is OK from a security perspective, you can always have a MemoryError at any point too, so programmers should not expose class creation to attackers without being able to deal with it failing). For the record, I found another md5 implementation that's a bit faster; first one is "sphlib" and second is "ulib": In [2]: %timeit extensibletype.extensibletype.md5bench2(10**3) 1000 loops, best of 3: 237 us per loop In [3]: %timeit extensibletype.extensibletype.md5bench(10**3) 1000 loops, best of 3: 374 us per loop http://www.saphir2.com/sphlib/ It's really only for extremely large projects like Sage where this can be noticed in any way. Dag From robertwb at gmail.com Tue Jun 12 20:12:02 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Tue, 12 Jun 2012 11:12:02 -0700 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD77AAC.6080905@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFC088.3000709@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> Message-ID: On Tue, Jun 12, 2012 at 10:21 AM, Dag Sverre Seljebotn wrote: > On 06/12/2012 01:01 PM, Dag Sverre Seljebotn wrote: >> >> On 06/10/2012 11:53 AM, Robert Bradshaw wrote: >>> >>> On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn >>>> >>>> About signatures, a problem I see with following the C typing is that >>>> the >>>> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and >>>> "iqq" >>>> on 32-bit Linux, and so on. I think that would be really bad. >>> >>> >>> This is why I suggested promotion for scalars (divide ints into >>> <=sizeof(long) and sizeof(long)< x<= sizeof(long long)), checked at >>> C compile time, though I guess you consider that evil. I don't >>> consider not matching really bad, just kind of bad. >> >> >> Right. At least a convention for promotion of scalars would be good >> anyway. >> >> Even MSVC supports stdint.h these days; basing ourselves on the random >> behaviour of "long" seems a bit outdated to me. "ssize_t" would be >> better motivated I feel. >> >> Many linear algebra libraries use 32-bit matrix indices by default, but >> can be swapped to 64-bit indices (this holds for many LAPACK >> implementations and most sparse linear algebra). So often there will at >> least be one typedef that is either 32 bits or 64 bits without the >> Cython compiler knowing. >> >> Promoting to a single type "[u]int64" is the only one that removes >> possible combinatorial explosion if you have multiple external typedefs >> that you don't know the size of (although I guess that's rather >> theoretical). >> >> Anyway, runtime table generation is quite fast, see below. >> >>> >>>> "l" must be banished -- but then one might as well do "i4i8i8". >>>> >>>> Designing a signature hash where you select between these at >>>> compile-time is >>>> perhaps doable but does generate a lot of code and makes everything >>>> complicated. >>> >>> >>> It especially gets messy when you're trying to pre-compute tables. >>> >>>> I think we should just start off with hashing at module load >>>> time when sizes are known, and then work with heuristics and/or build >>>> system >>>> integration to improve on that afterwards. >>> >>> >>> Finding 10,000 optimal tables at runtime better be really cheap than >>> for Sage's sake :). >> >> >> The code is highly unpolished as I write this, but it works so here's >> some preliminary benchmarks. >> >> Assuming the 64-bit pre-hashes are already computed, hashing a 64-slot >> table varies between 5 and 10 us (microseconds) depending on the set of >> hashes. >> >> Computing md5's with C code from ulib (not hashlib/OpenSSL) takes ~400ns >> per hash, so 26 us for the 64-slot table => it dominates! >> >> The crapwow64 hash takes ~10-20 ns, for ~1 us per 64-slot table. >> Admittedly, that's with hand-written non-portable assembly for the >> crapwow64. >> >> Assuming 10 000 64-slot tables we're looking at something like 0.3-0.4 >> seconds for loading Sage using md5, or 0.1 seconds using crapwow64. >> >> >> https://github.com/dagss/pyextensibletype/blob/master/include/perfecthash.h >> >> http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow64.html > > > Look: A big advantage of the hash-vtables is that subclasses stay > ABI-compatible with superclasses, and don't need recompilation when > superclasses adds or removes methods. > > => Finding the hash table must happen at run-time in a lot of cases anyway, > so I feel Robert's chase for a compile-time table building is moot. > > I feel this would also need to trigger automatically heap-allocated tables > if the statically allocated. Which is good to have in the very few cases > where a perfect table can't be found too. Finding the hash table at runtime should be supported, but the *vast* majority of methods sets is known at compile time. 0.4 seconds is a huge overhead to just add to Sage (yes, it's an exception, but an important one), and though crapwow64 helps I'd rather rely on a known, good standard hash. I need to actually look at Sage to see what the impact would be. Also, most tables would probably have 2 entries in them (e.g. a typed one and an all-object one). long int will continue to be an important type as long as it's the default for int literals and Python's "fast" ints (whether in type or implementation), so we can't just move to stdint. I also don't like that the form of the table (and whether certain signatures match) being platform-dependent: the less variance we have from one platform to the next is better. On an orthogonal note, sizeof(long)-sensitive tables need not be entirely at odds with compile-time table compilation, as most functions will probably have 0 or 1 parameters that are of unknown size, so we could spit out 1 or 2 statically compiled tables and do generate the rest on the fly. I still would rather have fixed Cython-compile time tables though. > One thing is that, which makes me feel uneasy about the relatively > unexplored crapwow64 is that we really don't want collisions in the 64-bit > prehashes within a single table (which would raise an exception -- which I > think is OK from a security perspective, you can always have a MemoryError > at any point too, so programmers should not expose class creation to > attackers without being able to deal with it failing). > > For the record, I found another md5 implementation that's a bit faster; > first one is "sphlib" and second is "ulib": > > In [2]: %timeit extensibletype.extensibletype.md5bench2(10**3) > 1000 loops, best of 3: 237 us per loop > > In [3]: %timeit extensibletype.extensibletype.md5bench(10**3) > 1000 loops, best of 3: 374 us per loop > > http://www.saphir2.com/sphlib/ > > It's really only for extremely large projects like Sage where this can be > noticed in any way. > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Tue Jun 12 16:13:03 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 12 Jun 2012 16:13:03 +0200 Subject: [Cython] "__pyx_dynamic_args" undeclared in fused types code Message-ID: <4FD74E6F.1070001@behnel.de> Hi, after the merge of the "_fused_dispatch_rebased" branch, I get C compile errors in a simple fused types example: """ from cython cimport integral # define a fused type for different containers ctypedef fused container: list tuple object # define a generic function using the above types cpdef sum(container items, integral start = 0): cdef integral item, result result = start for item in items: result += item return result def test(): cdef int x = 1, y = 2 # call [list,int] specialisation implicitly print( sum([1,2,3,4], x) ) # calls [object,long] specialisation explicitly print( sum[object,long]([1,2,3,4], y) ) """ The C compiler complains that "__pyx_dynamic_args" is undeclared - supposedly something should have been passed into the function but wasn't. Mark, could you take a look? Stefan From d.s.seljebotn at astro.uio.no Tue Jun 12 21:46:22 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 12 Jun 2012 21:46:22 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> Message-ID: <4FD79C8E.9030009@astro.uio.no> On 06/12/2012 08:12 PM, Robert Bradshaw wrote: > On Tue, Jun 12, 2012 at 10:21 AM, Dag Sverre Seljebotn > wrote: >> On 06/12/2012 01:01 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/10/2012 11:53 AM, Robert Bradshaw wrote: >>>> >>>> On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn >>>>> >>>>> About signatures, a problem I see with following the C typing is that >>>>> the >>>>> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and >>>>> "iqq" >>>>> on 32-bit Linux, and so on. I think that would be really bad. >>>> >>>> >>>> This is why I suggested promotion for scalars (divide ints into >>>> <=sizeof(long) and sizeof(long)< x<= sizeof(long long)), checked at >>>> C compile time, though I guess you consider that evil. I don't >>>> consider not matching really bad, just kind of bad. >>> >>> >>> Right. At least a convention for promotion of scalars would be good >>> anyway. >>> >>> Even MSVC supports stdint.h these days; basing ourselves on the random >>> behaviour of "long" seems a bit outdated to me. "ssize_t" would be >>> better motivated I feel. >>> >>> Many linear algebra libraries use 32-bit matrix indices by default, but >>> can be swapped to 64-bit indices (this holds for many LAPACK >>> implementations and most sparse linear algebra). So often there will at >>> least be one typedef that is either 32 bits or 64 bits without the >>> Cython compiler knowing. >>> >>> Promoting to a single type "[u]int64" is the only one that removes >>> possible combinatorial explosion if you have multiple external typedefs >>> that you don't know the size of (although I guess that's rather >>> theoretical). >>> >>> Anyway, runtime table generation is quite fast, see below. >>> >>>> >>>>> "l" must be banished -- but then one might as well do "i4i8i8". >>>>> >>>>> Designing a signature hash where you select between these at >>>>> compile-time is >>>>> perhaps doable but does generate a lot of code and makes everything >>>>> complicated. >>>> >>>> >>>> It especially gets messy when you're trying to pre-compute tables. >>>> >>>>> I think we should just start off with hashing at module load >>>>> time when sizes are known, and then work with heuristics and/or build >>>>> system >>>>> integration to improve on that afterwards. >>>> >>>> >>>> Finding 10,000 optimal tables at runtime better be really cheap than >>>> for Sage's sake :). >>> >>> >>> The code is highly unpolished as I write this, but it works so here's >>> some preliminary benchmarks. >>> >>> Assuming the 64-bit pre-hashes are already computed, hashing a 64-slot >>> table varies between 5 and 10 us (microseconds) depending on the set of >>> hashes. >>> >>> Computing md5's with C code from ulib (not hashlib/OpenSSL) takes ~400ns >>> per hash, so 26 us for the 64-slot table => it dominates! >>> >>> The crapwow64 hash takes ~10-20 ns, for ~1 us per 64-slot table. >>> Admittedly, that's with hand-written non-portable assembly for the >>> crapwow64. >>> >>> Assuming 10 000 64-slot tables we're looking at something like 0.3-0.4 >>> seconds for loading Sage using md5, or 0.1 seconds using crapwow64. >>> >>> >>> https://github.com/dagss/pyextensibletype/blob/master/include/perfecthash.h >>> >>> http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow64.html >> >> >> Look: A big advantage of the hash-vtables is that subclasses stay >> ABI-compatible with superclasses, and don't need recompilation when >> superclasses adds or removes methods. >> >> => Finding the hash table must happen at run-time in a lot of cases anyway, >> so I feel Robert's chase for a compile-time table building is moot. >> >> I feel this would also need to trigger automatically heap-allocated tables >> if the statically allocated. Which is good to have in the very few cases >> where a perfect table can't be found too. > > Finding the hash table at runtime should be supported, but the *vast* > majority of methods sets is known at compile time. 0.4 seconds is a > huge overhead to just add to Sage (yes, it's an exception, but an > important one), and though crapwow64 helps I'd rather rely on a known, > good standard hash. I need to actually look at Sage to see what the > impact would be. Also, most tables would probably have 2 entries in > them (e.g. a typed one and an all-object one). Hopefully 0.4 was a severe overestimate once one actually looks at this. What's loaded at startup -- is it the pyx files in sage/devel/sage? My count (just cloned from github.com/sagemath/sage): $ find -name '*.pyx' -exec grep 'cdef class' {} \; | wc -l 641 And I doubt that *all* of that is loaded at Sage startup, you need to do some manual importing for at least some of those classes? So it's probably closer to 0.01-0.02 seconds than 0.4 even with md5? About the *vast* majority of method sets being known: That may be the case for old code, but keep in mind that that situation might deteriorate. Once we have hash-based vtables, declaring methods of cdef classes in pxd files could become optional (and only be there to help callers, incl. subclasses, determine the signature). So any method that's only used in the superclass and is therefore not declared in the pxd file would consistently trigger a run-time build of the table of subclasses; the compile-time generated table would be useless then. (OTOH, as duck-typing becomes the norm, more cdef classes will be without superclasses...) > long int will continue to be an important type as long as it's the > default for int literals and Python's "fast" ints (whether in type or > implementation), so we can't just move to stdint. I also don't like > that the form of the table (and whether certain signatures match) > being platform-dependent: the less variance we have from one platform > to the next is better. Perhaps in Sage there's a lot of use of "long" and therefore this would make Sage code vary less between platforms. But for much NumPy-using code you'd typically use int32 or int64, and since long is 32 bits on 32-bit Windows and 64 bits on Linux/Mac, choosing long sort of maximises inter-platform variation of signatures... > On an orthogonal note, sizeof(long)-sensitive tables need not be > entirely at odds with compile-time table compilation, as most > functions will probably have 0 or 1 parameters that are of unknown > size, so we could spit out 1 or 2 statically compiled tables and do > generate the rest on the fly. I still would rather have fixed > Cython-compile time tables though. Well, I'd "rather have" that as well if it worked every time. But there's no use designing a feature which works great unless you use the fftw_complex type (can be 64 or 128 bits). Or works great unless you use 64-bit LAPACK. Or works great unless you have a superclass with a partially defined pxd file. Since one implementation of a concept is simpler than two, then as long as run-time generation code must always be there (or at least, be there in the common cases x, y, and z), the reasons should be very good for adding a compile-time implementation. Sage taking 0.4 seconds extra would indeed be a very good reason, but I don't believe it. So when you can get around to it it'd be great to have the actual number of classes (and ideally an estimate for number of methods per class). Dag From d.s.seljebotn at astro.uio.no Tue Jun 12 22:00:35 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Tue, 12 Jun 2012 22:00:35 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD79C8E.9030009@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> <4FD79C8E.9030009@astro.uio.no> Message-ID: <4FD79FE3.4000102@astro.uio.no> On 06/12/2012 09:46 PM, Dag Sverre Seljebotn wrote: > On 06/12/2012 08:12 PM, Robert Bradshaw wrote: >> On Tue, Jun 12, 2012 at 10:21 AM, Dag Sverre Seljebotn >> wrote: >>> On 06/12/2012 01:01 PM, Dag Sverre Seljebotn wrote: >>>> >>>> On 06/10/2012 11:53 AM, Robert Bradshaw wrote: >>>>> >>>>> On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn >>>>>> >>>>>> About signatures, a problem I see with following the C typing is that >>>>>> the >>>>>> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and >>>>>> "iqq" >>>>>> on 32-bit Linux, and so on. I think that would be really bad. >>>>> >>>>> >>>>> This is why I suggested promotion for scalars (divide ints into >>>>> <=sizeof(long) and sizeof(long)< x<= sizeof(long long)), checked at >>>>> C compile time, though I guess you consider that evil. I don't >>>>> consider not matching really bad, just kind of bad. >>>> >>>> >>>> Right. At least a convention for promotion of scalars would be good >>>> anyway. >>>> >>>> Even MSVC supports stdint.h these days; basing ourselves on the random >>>> behaviour of "long" seems a bit outdated to me. "ssize_t" would be >>>> better motivated I feel. >>>> >>>> Many linear algebra libraries use 32-bit matrix indices by default, but >>>> can be swapped to 64-bit indices (this holds for many LAPACK >>>> implementations and most sparse linear algebra). So often there will at >>>> least be one typedef that is either 32 bits or 64 bits without the >>>> Cython compiler knowing. >>>> >>>> Promoting to a single type "[u]int64" is the only one that removes >>>> possible combinatorial explosion if you have multiple external typedefs >>>> that you don't know the size of (although I guess that's rather >>>> theoretical). >>>> >>>> Anyway, runtime table generation is quite fast, see below. >>>> >>>>> >>>>>> "l" must be banished -- but then one might as well do "i4i8i8". >>>>>> >>>>>> Designing a signature hash where you select between these at >>>>>> compile-time is >>>>>> perhaps doable but does generate a lot of code and makes everything >>>>>> complicated. >>>>> >>>>> >>>>> It especially gets messy when you're trying to pre-compute tables. >>>>> >>>>>> I think we should just start off with hashing at module load >>>>>> time when sizes are known, and then work with heuristics and/or build >>>>>> system >>>>>> integration to improve on that afterwards. >>>>> >>>>> >>>>> Finding 10,000 optimal tables at runtime better be really cheap than >>>>> for Sage's sake :). >>>> >>>> >>>> The code is highly unpolished as I write this, but it works so here's >>>> some preliminary benchmarks. >>>> >>>> Assuming the 64-bit pre-hashes are already computed, hashing a 64-slot >>>> table varies between 5 and 10 us (microseconds) depending on the set of >>>> hashes. >>>> >>>> Computing md5's with C code from ulib (not hashlib/OpenSSL) takes >>>> ~400ns >>>> per hash, so 26 us for the 64-slot table => it dominates! >>>> >>>> The crapwow64 hash takes ~10-20 ns, for ~1 us per 64-slot table. >>>> Admittedly, that's with hand-written non-portable assembly for the >>>> crapwow64. >>>> >>>> Assuming 10 000 64-slot tables we're looking at something like 0.3-0.4 >>>> seconds for loading Sage using md5, or 0.1 seconds using crapwow64. >>>> >>>> >>>> https://github.com/dagss/pyextensibletype/blob/master/include/perfecthash.h >>>> >>>> >>>> http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow64.html >>> >>> >>> Look: A big advantage of the hash-vtables is that subclasses stay >>> ABI-compatible with superclasses, and don't need recompilation when >>> superclasses adds or removes methods. >>> >>> => Finding the hash table must happen at run-time in a lot of cases >>> anyway, >>> so I feel Robert's chase for a compile-time table building is moot. >>> >>> I feel this would also need to trigger automatically heap-allocated >>> tables >>> if the statically allocated. Which is good to have in the very few cases >>> where a perfect table can't be found too. >> >> Finding the hash table at runtime should be supported, but the *vast* >> majority of methods sets is known at compile time. 0.4 seconds is a >> huge overhead to just add to Sage (yes, it's an exception, but an >> important one), and though crapwow64 helps I'd rather rely on a known, >> good standard hash. I need to actually look at Sage to see what the >> impact would be. Also, most tables would probably have 2 entries in >> them (e.g. a typed one and an all-object one). > > Hopefully 0.4 was a severe overestimate once one actually looks at this. > > What's loaded at startup -- is it the pyx files in sage/devel/sage? My > count (just cloned from github.com/sagemath/sage): > > $ find -name '*.pyx' -exec grep 'cdef class' {} \; | wc -l > 641 > > And I doubt that *all* of that is loaded at Sage startup, you need to do > some manual importing for at least some of those classes? So it's > probably closer to 0.01-0.02 seconds than 0.4 even with md5? > > About the *vast* majority of method sets being known: That may be the > case for old code, but keep in mind that that situation might > deteriorate. Once we have hash-based vtables, declaring methods of cdef > classes in pxd files could become optional (and only be there to help > callers, incl. subclasses, determine the signature). So any method > that's only used in the superclass and is therefore not declared in the > pxd file would consistently trigger a run-time build of the table of > subclasses; the compile-time generated table would be useless then. > > (OTOH, as duck-typing becomes the norm, more cdef classes will be > without superclasses...) > >> long int will continue to be an important type as long as it's the >> default for int literals and Python's "fast" ints (whether in type or >> implementation), so we can't just move to stdint. I also don't like >> that the form of the table (and whether certain signatures match) >> being platform-dependent: the less variance we have from one platform >> to the next is better. > > Perhaps in Sage there's a lot of use of "long" and therefore this would > make Sage code vary less between platforms. > > But for much NumPy-using code you'd typically use int32 or int64, and > since long is 32 bits on 32-bit Windows and 64 bits on Linux/Mac, > choosing long sort of maximises inter-platform variation of signatures... > Also, promotion can't be used for pointers, buffers, ndarray dtypes... I don't mind heuristics that work in 99.9% of the cases. Heuristics that work in 80% of the cases seem more like a time drain though. But if there's indeed a problem with Sage load times, and a particular set of heuristics allows us to overcome what is otherwise a blocker for attaching these tables to cdef classes, then sure. Dag >> On an orthogonal note, sizeof(long)-sensitive tables need not be >> entirely at odds with compile-time table compilation, as most >> functions will probably have 0 or 1 parameters that are of unknown >> size, so we could spit out 1 or 2 statically compiled tables and do >> generate the rest on the fly. I still would rather have fixed >> Cython-compile time tables though. > > Well, I'd "rather have" that as well if it worked every time. > > But there's no use designing a feature which works great unless you use > the fftw_complex type (can be 64 or 128 bits). Or works great unless you > use 64-bit LAPACK. Or works great unless you have a superclass with a > partially defined pxd file. > > Since one implementation of a concept is simpler than two, then as long > as run-time generation code must always be there (or at least, be there > in the common cases x, y, and z), the reasons should be very good for > adding a compile-time implementation. > > Sage taking 0.4 seconds extra would indeed be a very good reason, but I > don't believe it. So when you can get around to it it'd be great to have > the actual number of classes (and ideally an estimate for number of > methods per class). > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From markflorisson88 at gmail.com Wed Jun 13 17:26:05 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Wed, 13 Jun 2012 16:26:05 +0100 Subject: [Cython] "__pyx_dynamic_args" undeclared in fused types code In-Reply-To: <4FD74E6F.1070001@behnel.de> References: <4FD74E6F.1070001@behnel.de> Message-ID: On Jun 12, 2012 8:15 PM, "Stefan Behnel" wrote: > > Hi, > > after the merge of the "_fused_dispatch_rebased" branch, I get C compile > errors in a simple fused types example: > > """ > from cython cimport integral > > # define a fused type for different containers > ctypedef fused container: > list > tuple > object > > # define a generic function using the above types > cpdef sum(container items, integral start = 0): > cdef integral item, result > result = start > for item in items: > result += item > return result > > def test(): > cdef int x = 1, y = 2 > > # call [list,int] specialisation implicitly > print( sum([1,2,3,4], x) ) > > # calls [object,long] specialisation explicitly > print( sum[object,long]([1,2,3,4], y) ) > """ > > The C compiler complains that "__pyx_dynamic_args" is undeclared - > supposedly something should have been passed into the function but wasn't. > > Mark, could you take a look? > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Thanks for pointing that out Stefan, I'll get that fixed for 0.17. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Mon Jun 18 16:12:08 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jun 2012 16:12:08 +0200 Subject: [Cython] new FFI library for Python Message-ID: <4FDF3738.9040006@behnel.de> Hi, the PyPy folks have come up with a new FFI library (called cffi) for CPython (and eventually PyPy, obviously). http://cffi.readthedocs.org/ It borrows from LuaJIT's FFI in that it parses C declarations at runtime. It then builds a C extension to access the external code, i.e. it requires a C compiler at runtime (when running in CPython). Just thought this might be interesting. Stefan From redbrain at gcc.gnu.org Mon Jun 18 17:26:09 2012 From: redbrain at gcc.gnu.org (Philip Herron) Date: Mon, 18 Jun 2012 16:26:09 +0100 Subject: [Cython] new FFI library for Python In-Reply-To: <4FDF3738.9040006@behnel.de> References: <4FDF3738.9040006@behnel.de> Message-ID: On 18 June 2012 15:12, Stefan Behnel wrote: > Hi, > > the PyPy folks have come up with a new FFI library (called cffi) for > CPython (and eventually PyPy, obviously). > > http://cffi.readthedocs.org/ > > It borrows from LuaJIT's FFI in that it parses C declarations at runtime. > It then builds a C extension to access the external code, i.e. it requires > a C compiler at runtime (when running in CPython). > > Just thought this might be interesting. > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel I have been using libffi in my gccpy runtime wonder why they decided to make a new one and not use libffi --Phil From stefan_ml at behnel.de Mon Jun 18 18:39:19 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jun 2012 18:39:19 +0200 Subject: [Cython] new FFI library for Python In-Reply-To: References: <4FDF3738.9040006@behnel.de> Message-ID: <4FDF59B7.4030509@behnel.de> Philip Herron, 18.06.2012 17:26: > On 18 June 2012 15:12, Stefan Behnel wrote: >> the PyPy folks have come up with a new FFI library (called cffi) for >> CPython (and eventually PyPy, obviously). >> >> http://cffi.readthedocs.org/ >> >> It borrows from LuaJIT's FFI in that it parses C declarations at runtime. >> It then builds a C extension to access the external code, i.e. it requires >> a C compiler at runtime (when running in CPython). >> >> Just thought this might be interesting. > > I have been using libffi in my gccpy runtime wonder why they decided > to make a new one and not use libffi Isn't libffi RPython? That's enough of a reason, I'd say. Stefan From stefan_ml at behnel.de Mon Jun 18 21:46:37 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 18 Jun 2012 21:46:37 +0200 Subject: [Cython] new FFI library for Python In-Reply-To: <4FDF3738.9040006@behnel.de> References: <4FDF3738.9040006@behnel.de> Message-ID: <4FDF859D.2090008@behnel.de> Stefan Behnel, 18.06.2012 16:12: > the PyPy folks have come up with a new FFI library (called cffi) for > CPython (and eventually PyPy, obviously). > > http://cffi.readthedocs.org/ > > It borrows from LuaJIT's FFI in that it parses C declarations at runtime. > It then builds a C extension to access the external code, i.e. it requires > a C compiler at runtime (when running in CPython). > > Just thought this might be interesting. The code is here, BTW: https://bitbucket.org/cffi/cffi/ One interesting feature is that they seem to support different backends. There's apparently one for libffi and one for ctypes so far. Another one based on Cython would be cool. Even the existing ffi backend implementation would have looked better in Cython, it's currently some 3000 lines of C code. And Cython could certainly benefit from an ffi backend itself for a couple of tasks, this topic has come up before a couple of times. Stefan From sturla at molden.no Tue Jun 19 13:25:02 2012 From: sturla at molden.no (Sturla Molden) Date: Tue, 19 Jun 2012 13:25:02 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FD79C8E.9030009@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> <4FD79C8E.9030009@astro.uio.no> Message-ID: <4FE0618E.5020009@molden.no> On 12.06.2012 21:46, Dag Sverre Seljebotn wrote: > But for much NumPy-using code you'd typically use int32 or int64, and > since long is 32 bits on 32-bit Windows and 64 bits on Linux/Mac, > choosing long sort of maximises inter-platform variation of signatures... The size of a long is compiler dependent, not OS dependent. Most C compilers for Windows use 32 bit long, also on 64-bit Windows for AMD64. The reason is that the AMD64 architecture natively uses a "64-bit pointer with a 32-bit offset". So indexing with a 64-bit offset could incur some extra overhead. (I don't know how much, if any at all.) On IA64 the C compilers for Windows use 64 bit long, because the native offset size is 64 bit. The C standard specify that a long is "at least 32 bits". Any code that assumes a specific sizeof(long), or that a long is 64-bits, does not follow the C standard. Sturla From sturla at molden.no Tue Jun 19 13:58:53 2012 From: sturla at molden.no (Sturla Molden) Date: Tue, 19 Jun 2012 13:58:53 +0200 Subject: [Cython] new FFI library for Python In-Reply-To: <4FDF3738.9040006@behnel.de> References: <4FDF3738.9040006@behnel.de> Message-ID: <4FE0697D.3020306@molden.no> On 18.06.2012 16:12, Stefan Behnel wrote: > the PyPy folks have come up with a new FFI library (called cffi) for > CPython (and eventually PyPy, obviously). It looks like ctypes albeit with a smaller API. (C definitions as text strings instead of Python objects.) Sometimes I think Python and a ffi would always suffice. But in practice Cython's __dealloc__ can be indispensible, as opposed to a Python __del__ method which can be unreliable. And Python's module loader mostly takes care of the common problem of DLL hell. With a ffi like ctypes or cffi, we don't have the RAII-like cleanup that __dealloc__ provides, and loading the DLLs suffer from all the nastyness of DLL hell. Sturla From robertwb at gmail.com Tue Jun 19 21:01:55 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Tue, 19 Jun 2012 12:01:55 -0700 Subject: [Cython] new FFI library for Python In-Reply-To: <4FE0697D.3020306@molden.no> References: <4FDF3738.9040006@behnel.de> <4FE0697D.3020306@molden.no> Message-ID: On Tue, Jun 19, 2012 at 4:58 AM, Sturla Molden wrote: > On 18.06.2012 16:12, Stefan Behnel wrote: > >> the PyPy folks have come up with a new FFI library (called cffi) for >> CPython (and eventually PyPy, obviously). > > > It looks like ctypes albeit with a smaller API. (C definitions as text > strings instead of Python objects.) > > Sometimes I think Python and a ffi would always suffice. But in practice > Cython's __dealloc__ can be indispensible, as opposed to a Python __del__ > method which can be unreliable. And Python's module loader mostly takes care > of the common problem of DLL hell. This also assumes you're always able and willing to use/write a library written in a lower-level language like C or Fortran to actually do your heavy lifting. Cython (ideally) allows you to write your actual number-crunching code without learning an (entirely) new language. - Robert From stefan_ml at behnel.de Thu Jun 21 10:59:39 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Jun 2012 10:59:39 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape Message-ID: <4FE2E27B.8010102@behnel.de> Hi, I find this worth fixing for 0.17: http://trac.cython.org/cython_trac/ticket/780 Stefan From markflorisson88 at gmail.com Thu Jun 21 12:38:11 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Thu, 21 Jun 2012 11:38:11 +0100 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE2E27B.8010102@behnel.de> References: <4FE2E27B.8010102@behnel.de> Message-ID: On 21 June 2012 09:59, Stefan Behnel wrote: > Hi, > > I find this worth fixing for 0.17: > > http://trac.cython.org/cython_trac/ticket/780 > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel It seems that arrays are compared as pointer values, so it doesn't even compare sensibly anyway. You can easily work around it by writing ( memoryview).shape though. I think these shape/strides/suboffset arrays should have a special type and coerce to tuples when coercing to an object. Feel free to work on that, it wouldn't really require touching much or any of the memoryview code, it's not really on m priority list right now. BTW, Stefan, how do we start Jenkins on the sage server? It's been down for weeks now. From stefan_ml at behnel.de Thu Jun 21 13:00:11 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Jun 2012 13:00:11 +0200 Subject: [Cython] Jenkins status In-Reply-To: References: <4FE2E27B.8010102@behnel.de> Message-ID: <4FE2FEBB.8050506@behnel.de> mark florisson, 21.06.2012 12:38: > BTW, Stefan, how do we start Jenkins on the sage server? It's been > down for weeks now. It seems like the sage.math server would be happy about a restart. I'll trigger the ML. Stefan From d.s.seljebotn at astro.uio.no Thu Jun 21 13:10:05 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 21 Jun 2012 13:10:05 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE2E27B.8010102@behnel.de> References: <4FE2E27B.8010102@behnel.de> Message-ID: <4FE3010D.4060700@astro.uio.no> On 06/21/2012 10:59 AM, Stefan Behnel wrote: > Hi, > > I find this worth fixing for 0.17: > > http://trac.cython.org/cython_trac/ticket/780 > I'm not sure about the timeline here. The object<->memoryview semantics haven't even been hammered down yet; does "mview.customattr" trigger an AttributeError, SyntaxError or fall back to some underlying object (constructing it if necesarry). Until that happens, memoryviews are an experimental feature and present for development purposes mostly, so it's not like this is a big bug that would bite end-users. Thinking about those semantics is much more important... Dag From stefan_ml at behnel.de Thu Jun 21 13:36:01 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Jun 2012 13:36:01 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE3010D.4060700@astro.uio.no> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> Message-ID: <4FE30721.2070902@behnel.de> Dag Sverre Seljebotn, 21.06.2012 13:10: > On 06/21/2012 10:59 AM, Stefan Behnel wrote: >> I find this worth fixing for 0.17: >> >> http://trac.cython.org/cython_trac/ticket/780 > > I'm not sure about the timeline here. > > The object<->memoryview semantics haven't even been hammered down yet; does > "mview.customattr" trigger an AttributeError, SyntaxError or fall back to > some underlying object (constructing it if necesarry). > > Until that happens, memoryviews are an experimental feature and present for > development purposes mostly, so it's not like this is a big bug that would > bite end-users. Thinking about those semantics is much more important... Absolutely. I ran into this when I gave a Cython+NumPy course and this was the first thing that the attendants tried when I asked them to validate that two input arrays have the same size before adding them. It's the one obvious way to do it, and it fails miserably. I think it should be fixed, and I think it should be fixed soon because it feels really low-level and complicated, especially to new users. Stefan From d.s.seljebotn at astro.uio.no Thu Jun 21 14:05:30 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 21 Jun 2012 14:05:30 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE30721.2070902@behnel.de> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> Message-ID: <4FE30E0A.8020003@astro.uio.no> On 06/21/2012 01:36 PM, Stefan Behnel wrote: > Dag Sverre Seljebotn, 21.06.2012 13:10: >> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>> I find this worth fixing for 0.17: >>> >>> http://trac.cython.org/cython_trac/ticket/780 >> >> I'm not sure about the timeline here. >> >> The object<->memoryview semantics haven't even been hammered down yet; does >> "mview.customattr" trigger an AttributeError, SyntaxError or fall back to >> some underlying object (constructing it if necesarry). >> >> Until that happens, memoryviews are an experimental feature and present for >> development purposes mostly, so it's not like this is a big bug that would >> bite end-users. Thinking about those semantics is much more important... > > Absolutely. > > I ran into this when I gave a Cython+NumPy course and this was the first > thing that the attendants tried when I asked them to validate that two > input arrays have the same size before adding them. It's the one obvious > way to do it, and it fails miserably. I think it should be fixed, and I > think it should be fixed soon because it feels really low-level and > complicated, especially to new users. Can you clarify a bit -- did you give this course using np.ndarray[double, ndim=2], or double[:, :]? They're really very separate under the hood and the fix is different. Or, did you actually use object[double, ndim=2] like in the bug report? (Did me and Mark get around to propose deprecating this one on the list?) Dag From stefan_ml at behnel.de Thu Jun 21 14:59:19 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Jun 2012 14:59:19 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE30E0A.8020003@astro.uio.no> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> <4FE30E0A.8020003@astro.uio.no> Message-ID: <4FE31AA7.5040101@behnel.de> Dag Sverre Seljebotn, 21.06.2012 14:05: > On 06/21/2012 01:36 PM, Stefan Behnel wrote: >>> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>>> I find this worth fixing for 0.17: >>>> >>>> http://trac.cython.org/cython_trac/ticket/780 >>> >> I ran into this when I gave a Cython+NumPy course and this was the first >> thing that the attendants tried when I asked them to validate that two >> input arrays have the same size before adding them. It's the one obvious >> way to do it, and it fails miserably. I think it should be fixed, and I >> think it should be fixed soon because it feels really low-level and >> complicated, especially to new users. > > Can you clarify a bit -- did you give this course using np.ndarray[double, > ndim=2], or double[:, :]? They're really very separate under the hood and > the fix is different. > > Or, did you actually use object[double, ndim=2] like in the bug report? > (Did me and Mark get around to propose deprecating this one on the list?) IIRC, we used object[double, ndim=2] for both and I also tried it with a memory view as in the bug report. I thought that using "object" was the preferred way to do it? At least, it doesn't restrict the type of the buffer exporter, which I consider a good thing. Stefan From d.s.seljebotn at astro.uio.no Thu Jun 21 15:06:57 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 21 Jun 2012 15:06:57 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE31AA7.5040101@behnel.de> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> <4FE30E0A.8020003@astro.uio.no> <4FE31AA7.5040101@behnel.de> Message-ID: <4FE31C71.8060904@astro.uio.no> On 06/21/2012 02:59 PM, Stefan Behnel wrote: > Dag Sverre Seljebotn, 21.06.2012 14:05: >> On 06/21/2012 01:36 PM, Stefan Behnel wrote: >>>> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>>>> I find this worth fixing for 0.17: >>>>> >>>>> http://trac.cython.org/cython_trac/ticket/780 >>>> >>> I ran into this when I gave a Cython+NumPy course and this was the first >>> thing that the attendants tried when I asked them to validate that two >>> input arrays have the same size before adding them. It's the one obvious >>> way to do it, and it fails miserably. I think it should be fixed, and I >>> think it should be fixed soon because it feels really low-level and >>> complicated, especially to new users. >> >> Can you clarify a bit -- did you give this course using np.ndarray[double, >> ndim=2], or double[:, :]? They're really very separate under the hood and >> the fix is different. >> >> Or, did you actually use object[double, ndim=2] like in the bug report? >> (Did me and Mark get around to propose deprecating this one on the list?) > > IIRC, we used object[double, ndim=2] for both and I also tried it with a > memory view as in the bug report. I thought that using "object" was the > preferred way to do it? At least, it doesn't restrict the type of the > buffer exporter, which I consider a good thing. That's a very theoretical argument as NumPy arrays are in practice the only exporter. I always teach np.ndarray[double...]. I've never told anyone about object[...], I don't think it's in much use. For starters it's going to be horribly inefficient unless you also add "mode='strided'" within the brackets. My proposal (and Mark's I think) is: Since the memoryviews will neatly cover the general exporter case, and since the [] syntax is much overloaded already (used for C++ templates too), we should deprecate object[...] no matter what else happens. Depending on what's decided for np.ndarray[...], we have: Case A): Deprecate both np.ndarray[...] and object[...] Case B): Only deprecate object[...], keep np.ndarray[...] (e.g., through a decorator used in numpy.pxd on the ndarray type). So rather than having a trailing [] mean buffers unless it means something else (like C++ templates), we instead make np.ndarray a "template", through special compiler support. Dag From stefan_ml at behnel.de Thu Jun 21 15:34:27 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 21 Jun 2012 15:34:27 +0200 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE31C71.8060904@astro.uio.no> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> <4FE30E0A.8020003@astro.uio.no> <4FE31AA7.5040101@behnel.de> <4FE31C71.8060904@astro.uio.no> Message-ID: <4FE322E3.40706@behnel.de> Dag Sverre Seljebotn, 21.06.2012 15:06: > On 06/21/2012 02:59 PM, Stefan Behnel wrote: >> Dag Sverre Seljebotn, 21.06.2012 14:05: >>> On 06/21/2012 01:36 PM, Stefan Behnel wrote: >>>>> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>>>>> I find this worth fixing for 0.17: >>>>>> >>>>>> http://trac.cython.org/cython_trac/ticket/780 >>>>> >>>> I ran into this when I gave a Cython+NumPy course and this was the first >>>> thing that the attendants tried when I asked them to validate that two >>>> input arrays have the same size before adding them. It's the one obvious >>>> way to do it, and it fails miserably. I think it should be fixed, and I >>>> think it should be fixed soon because it feels really low-level and >>>> complicated, especially to new users. >>> >>> Can you clarify a bit -- did you give this course using np.ndarray[double, >>> ndim=2], or double[:, :]? They're really very separate under the hood and >>> the fix is different. >>> >>> Or, did you actually use object[double, ndim=2] like in the bug report? >>> (Did me and Mark get around to propose deprecating this one on the list?) >> >> IIRC, we used object[double, ndim=2] for both and I also tried it with a >> memory view as in the bug report. I thought that using "object" was the >> preferred way to do it? At least, it doesn't restrict the type of the >> buffer exporter, which I consider a good thing. > > That's a very theoretical argument as NumPy arrays are in practice the only > exporter. Except for, say, bytes objects, array.array and user implemented types, that is. lxml has buffer support for its serialised XSLT output, for example. > I always teach np.ndarray[double...]. I've never told anyone about > object[...], I don't think it's in much use. For starters it's going to be > horribly inefficient unless you also add "mode='strided'" within the brackets. Ah, good to know. > My proposal (and Mark's I think) is: > > Since the memoryviews will neatly cover the general exporter case, and > since the [] syntax is much overloaded already (used for C++ templates > too), we should deprecate object[...] no matter what else happens. > > Depending on what's decided for np.ndarray[...], we have: > > Case A): Deprecate both np.ndarray[...] and object[...] > > Case B): Only deprecate object[...], keep np.ndarray[...] (e.g., through a > decorator used in numpy.pxd on the ndarray type). So rather than having a > trailing [] mean buffers unless it means something else (like C++ > templates), we instead make np.ndarray a "template", through special > compiler support. What's the point in technically deprecating them if we can't remove them without breaking code? Wouldn't it be better to deprecate them only in the docs? Stefan From markflorisson88 at gmail.com Thu Jun 21 16:24:11 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Thu, 21 Jun 2012 15:24:11 +0100 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE322E3.40706@behnel.de> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> <4FE30E0A.8020003@astro.uio.no> <4FE31AA7.5040101@behnel.de> <4FE31C71.8060904@astro.uio.no> <4FE322E3.40706@behnel.de> Message-ID: On 21 June 2012 14:34, Stefan Behnel wrote: > Dag Sverre Seljebotn, 21.06.2012 15:06: >> On 06/21/2012 02:59 PM, Stefan Behnel wrote: >>> Dag Sverre Seljebotn, 21.06.2012 14:05: >>>> On 06/21/2012 01:36 PM, Stefan Behnel wrote: >>>>>> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>>>>>> I find this worth fixing for 0.17: >>>>>>> >>>>>>> http://trac.cython.org/cython_trac/ticket/780 >>>>>> >>>>> I ran into this when I gave a Cython+NumPy course and this was the first >>>>> thing that the attendants tried when I asked them to validate that two >>>>> input arrays have the same size before adding them. It's the one obvious >>>>> way to do it, and it fails miserably. I think it should be fixed, and I >>>>> think it should be fixed soon because it feels really low-level and >>>>> complicated, especially to new users. >>>> >>>> Can you clarify a bit -- did you give this course using np.ndarray[double, >>>> ndim=2], or double[:, :]? They're really very separate under the hood and >>>> the fix is different. >>>> >>>> Or, did you actually use object[double, ndim=2] like in the bug report? >>>> (Did me and Mark get around to propose deprecating this one on the list?) >>> >>> IIRC, we used object[double, ndim=2] for both and I also tried it with a >>> memory view as in the bug report. I thought that using "object" was the >>> preferred way to do it? At least, it doesn't restrict the type of the >>> buffer exporter, which I consider a good thing. >> >> That's a very theoretical argument as NumPy arrays are in practice the only >> exporter. > > Except for, say, bytes objects, array.array and user implemented types, > that is. lxml has buffer support for its serialised XSLT output, for example. > You can already easily obtain a pointer from a bytes object, which is already 1d anyways :) Whether buffers on array.array are useful is still questionable given their variably-sized nature. >> I always teach np.ndarray[double...]. I've never told anyone about >> object[...], I don't think it's in much use. For starters it's going to be >> horribly inefficient unless you also add "mode='strided'" within the brackets. > > Ah, good to know. > > >> My proposal (and Mark's I think) is: >> >> Since the memoryviews will neatly cover the general exporter case, and >> since the [] syntax is much overloaded already (used for C++ templates >> too), we should deprecate object[...] no matter what else happens. > I agree with deprecating the object[] syntax, I think memoryviews should prove themselves a bit more for e.g. 0.17, before we start deprecating np.ndarray. >> Depending on what's decided for np.ndarray[...], we have: >> >> Case A): Deprecate both np.ndarray[...] and object[...] >> >> Case B): Only deprecate object[...], keep np.ndarray[...] (e.g., through a >> decorator used in numpy.pxd on the ndarray type). So rather than having a >> trailing [] mean buffers unless it means something else (like C++ >> templates), we instead make np.ndarray a "template", through special >> compiler support. > > What's the point in technically deprecating them if we can't remove them > without breaking code? Wouldn't it be better to deprecate them only in the > docs? > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From markflorisson88 at gmail.com Thu Jun 21 16:24:22 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Thu, 21 Jun 2012 15:24:22 +0100 Subject: [Cython] buffer shape incompatible with memoryview shape In-Reply-To: <4FE30E0A.8020003@astro.uio.no> References: <4FE2E27B.8010102@behnel.de> <4FE3010D.4060700@astro.uio.no> <4FE30721.2070902@behnel.de> <4FE30E0A.8020003@astro.uio.no> Message-ID: On 21 June 2012 13:05, Dag Sverre Seljebotn wrote: > On 06/21/2012 01:36 PM, Stefan Behnel wrote: >> >> Dag Sverre Seljebotn, 21.06.2012 13:10: >>> >>> On 06/21/2012 10:59 AM, Stefan Behnel wrote: >>>> >>>> I find this worth fixing for 0.17: >>>> >>>> http://trac.cython.org/cython_trac/ticket/780 >>> >>> >>> I'm not sure about the timeline here. >>> >>> The object<->memoryview semantics haven't even been hammered down yet; >>> does >>> "mview.customattr" trigger an AttributeError, SyntaxError or fall back to >>> some underlying object (constructing it if necesarry). >>> >>> Until that happens, memoryviews are an experimental feature and present >>> for >>> development purposes mostly, so it's not like this is a big bug that >>> would >>> bite end-users. Thinking about those semantics is much more important... >> >> >> Absolutely. >> >> I ran into this when I gave a Cython+NumPy course and this was the first >> thing that the attendants tried when I asked them to validate that two >> input arrays have the same size before adding them. It's the one obvious >> way to do it, and it fails miserably. I think it should be fixed, and I >> think it should be fixed soon because it feels really low-level and >> complicated, especially to new users. > > > Can you clarify a bit -- did you give this course using np.ndarray[double, > ndim=2], or double[:, :]? They're really very separate under the hood and > the fix is different. I think we should support both, although it seems a bit of a shame to fix something just a while before deprecating it :) Anyway, both fixes are really straightforward anyway. > Or, did you actually use object[double, ndim=2] like in the bug report? (Did > me and Mark get around to propose deprecating this one on the list?) > > Dag > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Fri Jun 22 19:51:42 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jun 2012 19:51:42 +0200 Subject: [Cython] Jenkins status In-Reply-To: <4FE2FEBB.8050506@behnel.de> References: <4FE2E27B.8010102@behnel.de> <4FE2FEBB.8050506@behnel.de> Message-ID: <4FE4B0AE.4090509@behnel.de> Stefan Behnel, 21.06.2012 13:00: > mark florisson, 21.06.2012 12:38: >> BTW, Stefan, how do we start Jenkins on the sage server? It's been >> down for weeks now. > > It seems like the sage.math server would be happy about a restart. I'll > trigger the ML. Jenkins is back up and building. Stefan From stefan_ml at behnel.de Fri Jun 22 22:30:17 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 22 Jun 2012 22:30:17 +0200 Subject: [Cython] Test failures in Jenkins Message-ID: <4FE4D5D9.4090507@behnel.de> Hi, Jenkins found a couple of test failures. I haven't looked through them yet, but if anything looks familiar or obvious to someone, please go ahead and fix it. https://sage.math.washington.edu:8091/hudson/job/cython-devel-tests/430/ Stefan From stefan_ml at behnel.de Sat Jun 23 10:11:30 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 23 Jun 2012 10:11:30 +0200 Subject: [Cython] new Jenkins setup Message-ID: <4FE57A32.40001@behnel.de> Hi, I moved the Jenkins installation out of the USB disk and into my home directory. The USB disk has proven very fragile in the past, so this will make us more independent from reboots and disk failures. To keep the builds fast, the workspaces have moved into a ramdisk, which is limited to 20 GB. This is less than the previous directory size, so I changed the job configs to delete redundant data after the builds, namely the unpacked CPython directories and the Cython installation. Those are still available in the job workspaces in form of the archives that the build jobs copy over at the beginning. So, to reproduce test failures on the Jenkins server, you can just unpack them manually. Currently, we are way below the limit (<5GB), but the developer branches haven't been built yet. It looks like no job takes more than 1GB when it runs, so the 6 active jobs that we run in parallel will not take more than 6GB in total. That leaves some 14 GB for the resident jobs. Those tend to stay around 30-100MB each (the Cython matrix jobs are more like 30-50MB per configuration), so we can keep quite a lot of jobs in the ramdisk. I'll keep an eye on them from time to time, but I think we'll be fine with that for a while. Still, please take a bit of care when you make changes to build jobs that you do not leave unnecessarily large sets of redundant data lying around at the end. Anything that we clearly no longer need after the build and that can be deleted will keep space free for running jobs and things like ccache. Stefan From stefan_ml at behnel.de Mon Jun 25 12:31:55 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Jun 2012 12:31:55 +0200 Subject: [Cython] static type checking in Python Message-ID: <4FE83E1B.4050409@behnel.de> Hi, there's some work going on regarding static type analysis and checking of Python programs, here's the mailing list for it: https://groups.google.com/group/python-static-type-checking I think this is somewhat related to Cython. After all, they are trying to figure out static type information from source code - although apparently rather in order to find bugs than to speed things up. But the one doesn't necessarily exclude the other. Stefan From drsalists at gmail.com Mon Jun 25 20:58:50 2012 From: drsalists at gmail.com (Dan Stromberg) Date: Mon, 25 Jun 2012 18:58:50 +0000 Subject: [Cython] new FFI library for Python In-Reply-To: <4FDF3738.9040006@behnel.de> References: <4FDF3738.9040006@behnel.de> Message-ID: Is it related to Common Lisp's CFFI? If not, it might be confusing to have two things with the same name, similar purposes, but not really the same thing. http://common-lisp.net/project/cffi/ On Mon, Jun 18, 2012 at 2:12 PM, Stefan Behnel wrote: > Hi, > > the PyPy folks have come up with a new FFI library (called cffi) for > CPython (and eventually PyPy, obviously). > > http://cffi.readthedocs.org/ > > It borrows from LuaJIT's FFI in that it parses C declarations at runtime. > It then builds a C extension to access the external code, i.e. it requires > a C compiler at runtime (when running in CPython). > > Just thought this might be interesting. > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Mon Jun 25 21:33:02 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 25 Jun 2012 21:33:02 +0200 Subject: [Cython] new FFI library for Python In-Reply-To: References: <4FDF3738.9040006@behnel.de> Message-ID: <4FE8BCEE.1040902@behnel.de> Dan Stromberg, 25.06.2012 20:58: > Is it related to Common Lisp's CFFI? If not, it might be confusing to > have two things with the same name, similar purposes, but not really > the same thing. > http://common-lisp.net/project/cffi/ I think "cffi" for "C foreign function interface" is just the one obvious name for such a thing. Stefan From stefan_ml at behnel.de Tue Jun 26 22:36:51 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 26 Jun 2012 22:36:51 +0200 Subject: [Cython] planning for 0.17 Message-ID: <4FEA1D63.5090902@behnel.de> Hi, I'd like to get an idea of what's still open for 0.17. Mark mentioned some open memoryview issues on his list and I know that there are still issues with PyPy, some of which could get fixed in a reasonable time frame. Also, Jenkins isn't all that happy yet. https://sage.math.washington.edu:8091/hudson/job/cython-devel-tests/ What's the current state of the master branch for everyone? Anything that you're working on and/or that you think should go in but isn't yet? I would like to see 0.17 released some time next month, if possible. I don't currently see any real blockers, so that might be doable. The release notes look ok so far, but the bug tracker list is really short in comparison. Please add to both as you see fit. http://wiki.cython.org/ReleaseNotes-0.17 http://trac.cython.org/cython_trac/query?status=closed&group=component&order=id&col=id&col=summary&col=milestone&col=status&col=type&col=priority&col=component&milestone=0.17&desc=1 Stefan From vitja.makarov at gmail.com Wed Jun 27 06:29:45 2012 From: vitja.makarov at gmail.com (Vitja Makarov) Date: Wed, 27 Jun 2012 08:29:45 +0400 Subject: [Cython] planning for 0.17 In-Reply-To: <4FEA1D63.5090902@behnel.de> References: <4FEA1D63.5090902@behnel.de> Message-ID: 2012/6/27 Stefan Behnel : > Hi, > > I'd like to get an idea of what's still open for 0.17. > > Mark mentioned some open memoryview issues on his list and I know that > there are still issues with PyPy, some of which could get fixed in a > reasonable time frame. Also, Jenkins isn't all that happy yet. > > https://sage.math.washington.edu:8091/hudson/job/cython-devel-tests/ > > What's the current state of the master branch for everyone? Anything that > you're working on and/or that you think should go in but isn't yet? > I'm ok with it. > I would like to see 0.17 released some time next month, if possible. I > don't currently see any real blockers, so that might be doable. > > The release notes look ok so far, but the bug tracker list is really short > in comparison. Please add to both as you see fit. > > http://wiki.cython.org/ReleaseNotes-0.17 > > http://trac.cython.org/cython_trac/query?status=closed&group=component&order=id&col=id&col=summary&col=milestone&col=status&col=type&col=priority&col=component&milestone=0.17&desc=1 > I've updated T766's milstone from 0.16 to 0.17 as it didn't get into 0.16 release. -- vitja. From stefan_ml at behnel.de Wed Jun 27 09:40:12 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 27 Jun 2012 09:40:12 +0200 Subject: [Cython] planning for 0.17 In-Reply-To: References: <4FEA1D63.5090902@behnel.de> Message-ID: <4FEAB8DC.4020200@behnel.de> Vitja Makarov, 27.06.2012 06:29: > I've updated T766's milstone from 0.16 to 0.17 as it didn't get into > 0.16 release. Could you add it to the release notes then? Stefan From vitja.makarov at gmail.com Wed Jun 27 11:17:49 2012 From: vitja.makarov at gmail.com (Vitja Makarov) Date: Wed, 27 Jun 2012 13:17:49 +0400 Subject: [Cython] planning for 0.17 In-Reply-To: <4FEAB8DC.4020200@behnel.de> References: <4FEA1D63.5090902@behnel.de> <4FEAB8DC.4020200@behnel.de> Message-ID: 2012/6/27 Stefan Behnel : > Vitja Makarov, 27.06.2012 06:29: >> I've updated T766's milstone from 0.16 to 0.17 as it didn't get into >> 0.16 release. > > Could you add it to the release notes then? > I think it's too minor change to be listed in release notes. -- vitja. From markflorisson88 at gmail.com Wed Jun 27 11:54:00 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Wed, 27 Jun 2012 10:54:00 +0100 Subject: [Cython] planning for 0.17 In-Reply-To: <4FEA1D63.5090902@behnel.de> References: <4FEA1D63.5090902@behnel.de> Message-ID: On 26 June 2012 21:36, Stefan Behnel wrote: > Hi, > > I'd like to get an idea of what's still open for 0.17. > > Mark mentioned some open memoryview issues on his list and I know that > there are still issues with PyPy, some of which could get fixed in a > reasonable time frame. Also, Jenkins isn't all that happy yet. > > https://sage.math.washington.edu:8091/hudson/job/cython-devel-tests/ > > What's the current state of the master branch for everyone? Anything that > you're working on and/or that you think should go in but isn't yet? > > I would like to see 0.17 released some time next month, if possible. I > don't currently see any real blockers, so that might be doable. > > The release notes look ok so far, but the bug tracker list is really short > in comparison. Please add to both as you see fit. > > http://wiki.cython.org/ReleaseNotes-0.17 > > http://trac.cython.org/cython_trac/query?status=closed&group=component&order=id&col=id&col=summary&col=milestone&col=status&col=type&col=priority&col=component&milestone=0.17&desc=1 > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Hey, Sounds good, I'll have a look at the memoryview tests. One is due to numpy headers redefining PyIndex_Check (though I thought I fixed that previously). Defaults for fused def functions may also fail in some cases, I'll try to fix that as well, or issue an error otherwise for now. That said, I'm busy with a dissertation and some other stuff, so if anyone would like to pick up the release for 0.17, I'd be much obliged. I can't test it right now, but I don't understand the following in the release notes (regarding array.array): "Note that only the buffer syntax is supported for these arrays. To use memoryviews with them, use the buffer syntax to unpack the buffer first.". Why is that, it implements __getbuffer__ right? So it shouldn't matter whether you use memoryviews or buffer syntax, both use __Pyx_GetBuffer(). Mark From stefan_ml at behnel.de Wed Jun 27 13:59:39 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 27 Jun 2012 13:59:39 +0200 Subject: [Cython] planning for 0.17 In-Reply-To: References: <4FEA1D63.5090902@behnel.de> Message-ID: <4FEAF5AB.8040900@behnel.de> mark florisson, 27.06.2012 11:54: > if anyone would like to pick up the release for 0.17, I'd be much > obliged. I think I can handle it. :) > I can't test it right now, but I don't understand the following in the > release notes (regarding array.array): "Note that only the buffer > syntax is supported for these arrays. To use memoryviews with them, > use the buffer syntax to unpack the buffer first.". Why is that, it > implements __getbuffer__ right? So it shouldn't matter whether you use > memoryviews or buffer syntax, both use __Pyx_GetBuffer(). The problem is that arrayarray.pxd is only used when the exporter is typed. This means that you can't do this: def func(int[:] arr): pass func(array.array('i', [1,2,3])) but it will work if func() is defined like this: def func(array.array arr): cdef int[:] view = arr I admit that the wording in the release notes is wrong, I wrote it because I initially thought that you had to do this: def func(array.array[int] arr): cdef int[:] view = arr But no, you don't have to use the buffer interface, you just have to type the variable. I'll update the release notes. Works better in Py3, obviously. Stefan From markflorisson88 at gmail.com Wed Jun 27 14:17:47 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Wed, 27 Jun 2012 13:17:47 +0100 Subject: [Cython] planning for 0.17 In-Reply-To: <4FEAF5AB.8040900@behnel.de> References: <4FEA1D63.5090902@behnel.de> <4FEAF5AB.8040900@behnel.de> Message-ID: On 27 June 2012 12:59, Stefan Behnel wrote: > mark florisson, 27.06.2012 11:54: >> if anyone would like to pick up the ?release for 0.17, I'd be much >> obliged. > > I think I can handle it. :) > Great, thanks! >> I can't test it right now, but I don't understand the following in the >> release notes (regarding array.array): "Note that only the buffer >> syntax is supported for these arrays. To use memoryviews with them, >> use the buffer syntax to unpack the buffer first.". Why is that, it >> implements __getbuffer__ right? So it shouldn't matter whether you use >> memoryviews or buffer syntax, both use __Pyx_GetBuffer(). > > The problem is that arrayarray.pxd is only used when the exporter is typed. > This means that you can't do this: > > ? ?def func(int[:] arr): pass > > ? ?func(array.array('i', [1,2,3])) That works for me, as long and array is cimported from cpython (as 'array' or some other name). It will patch __Pyx_GetBuffer with a typecheck and a call to its __getbuffer__ method. > but it will work if func() is defined like this: > > ? ?def func(array.array arr): > ? ? ? ?cdef int[:] view = arr > > I admit that the wording in the release notes is wrong, I wrote it because > I initially thought that you had to do this: > > ? ?def func(array.array[int] arr): > ? ? ? ?cdef int[:] view = arr > > But no, you don't have to use the buffer interface, you just have to type > the variable. I'll update the release notes. > > Works better in Py3, obviously. > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Wed Jun 27 14:48:04 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 27 Jun 2012 14:48:04 +0200 Subject: [Cython] planning for 0.17 In-Reply-To: References: <4FEA1D63.5090902@behnel.de> <4FEAF5AB.8040900@behnel.de> Message-ID: <4FEB0104.6050106@behnel.de> mark florisson, 27.06.2012 14:17: > On 27 June 2012 12:59, Stefan Behnel wrote: >> mark florisson, 27.06.2012 11:54: >>> I can't test it right now, but I don't understand the following in the >>> release notes (regarding array.array): "Note that only the buffer >>> syntax is supported for these arrays. To use memoryviews with them, >>> use the buffer syntax to unpack the buffer first.". Why is that, it >>> implements __getbuffer__ right? So it shouldn't matter whether you use >>> memoryviews or buffer syntax, both use __Pyx_GetBuffer(). >> >> The problem is that arrayarray.pxd is only used when the exporter is typed. >> This means that you can't do this: >> >> def func(int[:] arr): pass >> >> func(array.array('i', [1,2,3])) > > That works for me, as long and array is cimported from cpython (as > 'array' or some other name). It will patch __Pyx_GetBuffer with a > typecheck and a call to its __getbuffer__ method. Hmm, interesting. I keep learning. I'll add tests for that. For the memoryview_type and array_type checks, wouldn't a type identity test be enough instead of a PyObject_TypeCheck() ? Stefan From markflorisson88 at gmail.com Wed Jun 27 15:03:56 2012 From: markflorisson88 at gmail.com (mark florisson) Date: Wed, 27 Jun 2012 14:03:56 +0100 Subject: [Cython] planning for 0.17 In-Reply-To: <4FEB0104.6050106@behnel.de> References: <4FEA1D63.5090902@behnel.de> <4FEAF5AB.8040900@behnel.de> <4FEB0104.6050106@behnel.de> Message-ID: On 27 June 2012 13:48, Stefan Behnel wrote: > mark florisson, 27.06.2012 14:17: >> On 27 June 2012 12:59, Stefan Behnel wrote: >>> mark florisson, 27.06.2012 11:54: >>>> I can't test it right now, but I don't understand the following in the >>>> release notes (regarding array.array): "Note that only the buffer >>>> syntax is supported for these arrays. To use memoryviews with them, >>>> use the buffer syntax to unpack the buffer first.". Why is that, it >>>> implements __getbuffer__ right? So it shouldn't matter whether you use >>>> memoryviews or buffer syntax, both use __Pyx_GetBuffer(). >>> >>> The problem is that arrayarray.pxd is only used when the exporter is typed. >>> This means that you can't do this: >>> >>> ? ?def func(int[:] arr): pass >>> >>> ? ?func(array.array('i', [1,2,3])) >> >> That works for me, as long and array is cimported from cpython (as >> 'array' or some other name). It will patch __Pyx_GetBuffer with a >> typecheck and a call to its __getbuffer__ method. > > Hmm, interesting. I keep learning. I'll add tests for that. > > For the memoryview_type and array_type checks, wouldn't a type identity > test be enough instead of a PyObject_TypeCheck() ? > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel Well, you want it to work for subclasses as well. I think the only thing that doesn't work (pre-2.6), is overriding __getbuffer__ in a subclass outside of the module or pxd. For memoryviews, since each module has a different memoryview type, I inject a capsule in tp_dict, which __Pyx_GetBuffer checks for (it's called __pyx_getbuffer and __pyx_releasebuffer). From stefan_ml at behnel.de Wed Jun 27 15:09:32 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 27 Jun 2012 15:09:32 +0200 Subject: [Cython] planning for 0.17 In-Reply-To: References: <4FEA1D63.5090902@behnel.de> <4FEAF5AB.8040900@behnel.de> <4FEB0104.6050106@behnel.de> Message-ID: <4FEB060C.5060002@behnel.de> mark florisson, 27.06.2012 15:03: > On 27 June 2012 13:48, Stefan Behnel wrote: >> mark florisson, 27.06.2012 14:17: >>> That works for me, as long and array is cimported from cpython (as >>> 'array' or some other name). It will patch __Pyx_GetBuffer with a >>> typecheck and a call to its __getbuffer__ method. >> >> For the memoryview_type and array_type checks, wouldn't a type identity >> test be enough instead of a PyObject_TypeCheck() ? > > Well, you want it to work for subclasses as well. Fine with me. > I think the only > thing that doesn't work (pre-2.6), is overriding __getbuffer__ in a > subclass outside of the module or pxd. For memoryviews, since each > module has a different memoryview type, I inject a capsule in tp_dict, > which __Pyx_GetBuffer checks for (it's called __pyx_getbuffer and > __pyx_releasebuffer). I'm sure it'll be a lot of fun to rip that out when we finally drop support for Python 2.5 ... Stefan From dieter at handshake.de Thu Jun 28 09:04:14 2012 From: dieter at handshake.de (Dieter Maurer) Date: Thu, 28 Jun 2012 09:04:14 +0200 Subject: [Cython] Feature request: generate signature information for use by "inspect" Message-ID: <20460.494.30787.680552@localhost.localdomain> Python's "inspect" module is a great help to get valuable information about a package. Many higher level tools (e.g. the "help" builtin and "pydoc") are based on it. I have just recognized a deficiency of "cython" generated modules with respect to "inspect" support: "inspect" cannot determine the signatures for Python functions defined in "Cython" source. I understand that this might be a limitation of Python's "C" interface. In this case, I suggest to enhance the function's docstring by signature information. I now transform manually my docstrings def (): """
""" into: def (): """() -> :
""" and would be happy to get something similar automatically. -- Dieter From stefan_ml at behnel.de Thu Jun 28 09:25:32 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 28 Jun 2012 09:25:32 +0200 Subject: [Cython] Feature request: generate signature information for use by "inspect" In-Reply-To: <20460.494.30787.680552@localhost.localdomain> References: <20460.494.30787.680552@localhost.localdomain> Message-ID: <4FEC06EC.2000107@behnel.de> Dieter Maurer, 28.06.2012 09:04: > Python's "inspect" module is a great help to get valuable information > about a package. Many higher level tools (e.g. the "help" builtin > and "pydoc") are based on it. > > I have just recognized a deficiency of "cython" generated > modules with respect to "inspect" support: > > "inspect" cannot determine the signatures for Python functions > defined in "Cython" source. > > I understand that this might be a limitation of Python's "C" > interface. Correct, although Cython goes to great length to enable introspection of Cython implemented functions and classes (admittedly, we could still do more...) > In this case, I suggest to enhance the > function's docstring by signature information. > > I now transform manually my docstrings > > def (): > """
> > > """ > > into: > > def (): > """() -> :
> > > """ > > and would be happy to get something similar automatically. And the time machine strikes again. You can use the "embedsignature" compiler option for that. http://docs.cython.org/src/reference/compilation.html?highlight=embedsignature#compiler-directives Stefan From dieter at handshake.de Thu Jun 28 11:12:03 2012 From: dieter at handshake.de (Dieter Maurer) Date: Thu, 28 Jun 2012 11:12:03 +0200 Subject: [Cython] Feature request: generate signature information for use by "inspect" In-Reply-To: <4FEC06EC.2000107@behnel.de> References: <20460.494.30787.680552@localhost.localdomain> <4FEC06EC.2000107@behnel.de> Message-ID: <20460.8163.513094.324883@localhost.localdomain> Stefan Behnel wrote at 2012-6-28 09:25 +0200: >Dieter Maurer, 28.06.2012 09:04: >> ... >> In this case, I suggest to enhance the >> function's docstring by signature information. >> >> I now transform manually my docstrings >> >> def (): >> """
>> >> >> """ >> >> into: >> >> def (): >> """() -> :
>> >> >> """ >> >> and would be happy to get something similar automatically. > >And the time machine strikes again. You can use the "embedsignature" >compiler option for that. > >http://docs.cython.org/src/reference/compilation.html?highlight=embedsignature#compiler-directives Thank you! I missed this part of the documentation. -- Dieter From robertwb at gmail.com Thu Jun 28 10:59:59 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Thu, 28 Jun 2012 01:59:59 -0700 Subject: [Cython] Automatic C++ conversions Message-ID: I've been looking how painful it is to constantly convert between Python objects and string in C++. Yes, it's easy to write a utility, but this should be as natural (if not more so, as the length is explicit) than bytes <-> char*. Several other of the libcpp classes (vector, map) have natural Python analogues too. What would people think about making it possible to declare these in a C++ file? Being able to make arbitrary mappings anywhere between types is contextless global state that I'd rather avoid, but perhaps special methods defined on the class such as cdef extern from "" namespace "std": cdef cppclass string: def __object__(sting s): return s.c_str()[s.size()] def __create__(object o): return string(o, len(o)) ... (names open to suggestions) Then one could write cdef extern from *: string c_func(string) def f(x): return c_func(x) - Robert From stefan_ml at behnel.de Thu Jun 28 11:54:49 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 28 Jun 2012 11:54:49 +0200 Subject: [Cython] Automatic C++ conversions In-Reply-To: References: Message-ID: <4FEC29E9.9030609@behnel.de> Robert Bradshaw, 28.06.2012 10:59: > I've been looking how painful it is to constantly convert between > Python objects and string in C++. You mean std::string (as I think it's called)? Can't we just special case that in the same way that we special case char* and friends? Basically just one type more in that list. And it would give you efficient encoding/decoding more or less for free. I mean, well, it would likely break existing code to start doing that (in the same way that we broke code by enabling type inference for convertible pointers), but as long as it helps more than it breaks ... > Yes, it's easy to write a utility, > but this should be as natural (if not more so, as the length is > explicit) than bytes <-> char*. Several other of the libcpp classes > (vector, map) have natural Python analogues too. And you would want to enable coercion to those, too? Have a vector copy into a Python list automatically? (Although that's trivially done with a list comprehension, maybe the other way is more interesting...) I think, as long as there is one obvious mapping for a given type, I wouldn't mind letting Cython apply it automatically. > What would people think about making it possible to declare these in a > C++ file? Being able to make arbitrary mappings anywhere between types > is contextless global state that I'd rather avoid, but perhaps special > methods defined on the class such as > > cdef extern from "" namespace "std": > cdef cppclass string: > def __object__(sting s): > return s.c_str()[s.size()] > def __create__(object o): > return string(o, len(o)) > ... > > (names open to suggestions) Then one could write > > cdef extern from *: > string c_func(string) > > def f(x): > return c_func(x) Admittedly, it fits somewhat more naturally into C++ classes than generally into C, although we could allow the same thing in ctypedefs. However, I'm reluctant to introduce something like this as long as we can get away with built-in auto-coercion. Stefan From robertwb at gmail.com Thu Jun 28 12:07:13 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Thu, 28 Jun 2012 03:07:13 -0700 Subject: [Cython] Automatic C++ conversions In-Reply-To: <4FEC29E9.9030609@behnel.de> References: <4FEC29E9.9030609@behnel.de> Message-ID: On Thu, Jun 28, 2012 at 2:54 AM, Stefan Behnel wrote: > Robert Bradshaw, 28.06.2012 10:59: >> I've been looking how painful it is to constantly convert between >> Python objects and string in C++. > > You mean std::string (as I think it's called)? Can't we just special case > that in the same way that we special case char* and friends? Yes, we could. If we do that it'd make sense to special case list and vector and pair and and map and set as well, though perhaps those are special enough to hard code them, and it makes the language simpler to not have more special methods. > Basically just > one type more in that list. And it would give you efficient > encoding/decoding more or less for free. > > I mean, well, it would likely break existing code to start doing that (in > the same way that we broke code by enabling type inference for convertible > pointers), but as long as it helps more than it breaks ... I don't think it'd be backwards compatible, currently it's just an error. >> Yes, it's easy to write a utility, >> but this should be as natural (if not more so, as the length is >> explicit) than bytes <-> char*. Several other of the libcpp classes >> (vector, map) have natural Python analogues too. > > And you would want to enable coercion to those, too? Have a vector copy > into a Python list automatically? (Although that's trivially done with a > list comprehension, maybe the other way is more interesting...) > > I think, as long as there is one obvious mapping for a given type, I > wouldn't mind letting Cython apply it automatically. > > >> What would people think about making it possible to declare these in a >> C++ file? Being able to make arbitrary mappings anywhere between types >> is contextless global state that I'd rather avoid, but perhaps special >> methods defined on the class such as >> >> cdef extern from "" namespace "std": >> ? ? cdef cppclass string: >> ? ? ? ? def __object__(sting s): >> ? ? ? ? ? ? return s.c_str()[s.size()] >> ? ? ? ? def __create__(object o): >> ? ? ? ? ? ? return string(o, len(o)) >> ? ? ? ? ... >> >> (names open to suggestions) Then one could write >> >> cdef extern from *: >> ? ? string c_func(string) >> >> def f(x): >> ? ? return c_func(x) > > Admittedly, it fits somewhat more naturally into C++ classes than generally > into C, although we could allow the same thing in ctypedefs. > > However, I'm reluctant to introduce something like this as long as we can > get away with built-in auto-coercion. > > Stefan > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel From stefan_ml at behnel.de Thu Jun 28 14:10:07 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 28 Jun 2012 14:10:07 +0200 Subject: [Cython] Automatic C++ conversions In-Reply-To: References: <4FEC29E9.9030609@behnel.de> Message-ID: <4FEC499F.5050505@behnel.de> Robert Bradshaw, 28.06.2012 12:07: > On Thu, Jun 28, 2012 at 2:54 AM, Stefan Behnel wrote: >> Robert Bradshaw, 28.06.2012 10:59: >>> I've been looking how painful it is to constantly convert between >>> Python objects and string in C++. >> >> You mean std::string (as I think it's called)? Can't we just special case >> that in the same way that we special case char* and friends? > > Yes, we could. Then I think it makes sense to do that. Basically, the std::string type would set its is_string flag and then we need the actual coercion code for it. > If we do that it'd make sense to special case list and > vector and pair and and map and set as well, though perhaps those are > special enough to hard code them, and it makes the language simpler to > not have more special methods. Ok, then it's std::string <=> bytes std::vector <=> list std::map <=> dict std::set <=> set Potentially also: std::pair => tuple (maybe 2-tuple => std::pair with a runtime length test?) What about allowing list() etc.? As long as the item type can be coerced at compile time, this should be doable: => Python iterator and it would even be easy to implement in Cython code using a generator function. The other direction (Python iterator => ) would be trickier but could also be made to work when the C++ item type on the LHS of the assignment that triggers the coercion is known at compile time. We might want to look for a way to make these coercions a "thing" in the code (maybe through a registry or dedicated class) rather than adding special casing code everywhere. I think a CEP would be a good way to specify the above coercions. I also think that this is large enough a feature to openly ask for sponsorship. >> Basically just >> one type more in that list. And it would give you efficient >> encoding/decoding more or less for free. >> >> I mean, well, it would likely break existing code to start doing that (in >> the same way that we broke code by enabling type inference for convertible >> pointers), but as long as it helps more than it breaks ... > > I don't think it'd be backwards compatible, currently it's just an error. Ah, right, sorry. I got confused. Assignments to an untyped variable inherit the type of the RHS, so only typed assignments would be impacted, and those are currently errors, sure. Nothing in the way then. Stefan From stefan_ml at behnel.de Fri Jun 29 07:45:05 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jun 2012 07:45:05 +0200 Subject: [Cython] [cython-users] C++: how to handle failures of 'new'? In-Reply-To: References: <4FECA49B.8090404@behnel.de> Message-ID: <4FED40E1.1040301@behnel.de> [moving this to cython-devel as it's getting technical] Robert Bradshaw, 28.06.2012 21:46: > On Thu, Jun 28, 2012 at 11:38 AM, Stefan Behnel wrote: >> currently, when I write "new CppClass()" in Cython, it generates a straight >> call to the "new" operator. It doesn't do any error handling. And the >> current documentation doesn't even mention this case. >> >> Is there a "standard" way to handle this? It seems that C++ has different >> ways to deal with failures here but raises an exception by default. Would >> you declare the constructor(s) with an "except +MemoryError"? Is there a >> reason Cython shouldn't be doing this automatically (if nothing else was >> declared) ? > > I think it certainly makes sense to declare the default constructor as > "except +" (and std::bad_alloc should become MemoryError), Right. The code in the constructor can raise other exceptions that must also be handled properly. An explicit "except +" will handle that. > but whether > to implicitly annotate declared constructors is less clear, especially > as there's no way to un-annotate them. I agree, but sadly, it's the default behaviour that is wrong. I'm sure we made lots of users run into this trap already. I fixed the documentation for now, but the bottom line is that we require users to take care of proper declarations themselves. Otherwise, the code that we generate is incorrect, although it's 100% certain that an allocation error can occur, even if the constructor code doesn't raise any exceptions itself. Apparently, changing the behaviour of the "new" operator requires a special annotation "std::nothrow", which then returns NULL on allocation failures. You can pass that from Cython by hacking up a cname, e.g. Rectangle "(std::nothrow) Rectangle" (int w, int h) I'm sure there are users out there who figured this out (I mean, I did...) and use it in their code, so I agree that this isn't easy to handle because Cython simply wouldn't know what the actual error behaviour is for a given constructor and how to correctly detect an error. This problem applies only to heap allocation in that form. However, stack allocation and the new exttype field allocation suffer from similar problems when the default constructor raises an exception. Exttype fields are a particularly nasty case because the user has no control over the allocation. A C++ exception in the C++ class constructor would terminate the exttype constructor unexpectedly and thus leak resources (in the best case - no idea how CPython reacts if you throw a C++ exception through its type instantiation code). Similarly, a C++ exception in the constructor of a stack allocated object would then originate from the function entry code and potentially hit the Python function wrapper etc. Again, potentially leaking resources or worse. To me, this sounds like we should do something about it. At least for the implicit calls to the default constructor, we should generate "except +" code automatically because there is no other way to handle them safely. Stefan From robertwb at gmail.com Fri Jun 29 11:08:21 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Fri, 29 Jun 2012 02:08:21 -0700 Subject: [Cython] [cython-users] C++: how to handle failures of 'new'? In-Reply-To: <4FED40E1.1040301@behnel.de> References: <4FECA49B.8090404@behnel.de> <4FED40E1.1040301@behnel.de> Message-ID: On Thu, Jun 28, 2012 at 10:45 PM, Stefan Behnel wrote: > [moving this to cython-devel as it's getting technical] > > Robert Bradshaw, 28.06.2012 21:46: >> On Thu, Jun 28, 2012 at 11:38 AM, Stefan Behnel wrote: >>> currently, when I write "new CppClass()" in Cython, it generates a straight >>> call to the "new" operator. It doesn't do any error handling. And the >>> current documentation doesn't even mention this case. >>> >>> Is there a "standard" way to handle this? It seems that C++ has different >>> ways to deal with failures here but raises an exception by default. Would >>> you declare the constructor(s) with an "except +MemoryError"? Is there a >>> reason Cython shouldn't be doing this automatically (if nothing else was >>> declared) ? >> >> I think it certainly makes sense to declare the default constructor as >> "except +" (and std::bad_alloc should become MemoryError), > > Right. The code in the constructor can raise other exceptions that must > also be handled properly. An explicit "except +" will handle that. > > >> but whether >> to implicitly annotate declared constructors is less clear, especially >> as there's no way to un-annotate them. > > I agree, but sadly, it's the default behaviour that is wrong. I'm sure we > made lots of users run into this trap already. I fixed the documentation > for now, but the bottom line is that we require users to take care of > proper declarations themselves. Otherwise, the code that we generate is > incorrect, although it's 100% certain that an allocation error can occur, > even if the constructor code doesn't raise any exceptions itself. This is always the case. > Apparently, changing the behaviour of the "new" operator requires a special > annotation "std::nothrow", which then returns NULL on allocation failures. > You can pass that from Cython by hacking up a cname, e.g. > > ? ?Rectangle "(std::nothrow) Rectangle" (int w, int h) > > I'm sure there are users out there who figured this out (I mean, I did...) > and use it in their code, so I agree that this isn't easy to handle because > Cython simply wouldn't know what the actual error behaviour is for a given > constructor and how to correctly detect an error. > > This problem applies only to heap allocation in that form. However, stack > allocation and the new exttype field allocation suffer from similar > problems when the default constructor raises an exception. Exttype fields > are a particularly nasty case because the user has no control over the > allocation. A C++ exception in the C++ class constructor would terminate > the exttype constructor unexpectedly and thus leak resources (in the best > case - no idea how CPython reacts if you throw a C++ exception through its > type instantiation code). If the default constructor raises an exception then it should be declared (to not do so is an error on the users part). New raising bad_alloc is a bit of a special case, but doesn't appl to the stack or exttype allocations. > Similarly, a C++ exception in the constructor of a stack allocated object > would then originate from the function entry code and potentially hit the > Python function wrapper etc. Again, potentially leaking resources or worse. > > To me, this sounds like we should do something about it. At least for the > implicit calls to the default constructor, we should generate "except +" > code automatically because there is no other way to handle them safely. If no constructor is declared, it should be "except +" just to be safe, but otherwise I don't see how this is any different than forgetting to declare exceptions on any other function. Unfortunately catching exceptions (with custom per-object handling) on a set of stack allocated objects seems difficult if not impossible (without resorting to ugly hacks like using placement new everywhere). - Robert From dieter at handshake.de Fri Jun 29 11:25:53 2012 From: dieter at handshake.de (Dieter Maurer) Date: Fri, 29 Jun 2012 11:25:53 +0200 Subject: [Cython] Potential bug: hole in "C <-> Python" conversion Message-ID: <20461.29857.232429.210278@localhost.localdomain> I have cdef extern from *: ctypedef char const_unsigned_char "const unsigned char" cdef const_unsigned_char *c_data = data leads to "Cannot convert Python object to 'const_unsigned_char *'" while "cdef char *c_data = data" works. Should the "ctypedef char const_unsigned_char" not ensure that "char" and "const_unsigned_char" are used as synonyms? -- Dieter From stefan_ml at behnel.de Fri Jun 29 11:42:26 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jun 2012 11:42:26 +0200 Subject: [Cython] Potential bug: hole in "C <-> Python" conversion In-Reply-To: <20461.29857.232429.210278@localhost.localdomain> References: <20461.29857.232429.210278@localhost.localdomain> Message-ID: <4FED7882.1050008@behnel.de> Dieter Maurer, 29.06.2012 11:25: > I have > > cdef extern from *: > ctypedef char const_unsigned_char "const unsigned char" This is an incorrect declaration. "char" != "unsigned char". > cdef const_unsigned_char *c_data = data > > leads to "Cannot convert Python object to 'const_unsigned_char *'" > while "cdef char *c_data = data" works. > > Should the "ctypedef char const_unsigned_char" not ensure > that "char" and "const_unsigned_char" are used as synonyms? I assume you are not using the latest Cython (0.17pre) from github, are you? It should have a fix for this. Also note that libc.string contains declarations for "const char*" and friends. Stefan From dieter at handshake.de Fri Jun 29 12:18:46 2012 From: dieter at handshake.de (Dieter Maurer) Date: Fri, 29 Jun 2012 12:18:46 +0200 Subject: [Cython] Potential bug: hole in "C <-> Python" conversion In-Reply-To: <4FED7882.1050008@behnel.de> References: <20461.29857.232429.210278@localhost.localdomain> <4FED7882.1050008@behnel.de> Message-ID: <20461.33030.605969.344092@localhost.localdomain> Stefan Behnel wrote at 2012-6-29 11:42 +0200: >Dieter Maurer, 29.06.2012 11:25: >> I have >> >> cdef extern from *: >> ctypedef char const_unsigned_char "const unsigned char" > >This is an incorrect declaration. "char" != "unsigned char". You are right. I cheat to get "Cython" convert between "unsigned char*" and "bytes" in the same way as it does for "char *". For this conversion, there is no real difference between "char *" and "unsigned char *" (apart from a C level warning about a pointer of a bad type passed to "PyString_FromStringAndSize"). >> cdef const_unsigned_char *c_data = data >> >> leads to "Cannot convert Python object to 'const_unsigned_char *'" >> while "cdef char *c_data = data" works. >> >> Should the "ctypedef char const_unsigned_char" not ensure >> that "char" and "const_unsigned_char" are used as synonyms? > >I assume you are not using the latest Cython (0.17pre) from github, are >you? It should have a fix for this. You are right. I am using the "cython" version which comes with my operating system ("cython 0.13"). Very good, if the latest "Cython" behaves better :-) >Also note that libc.string contains declarations for "const char*" and friends. Unformatunately, I need "const unsigned char*" and "const xmlChar *" (where "xmlChar" is defined as "unsigned char"). I used the "libc.string" definitions as a blueprint for mine. -- Dieter From stefan_ml at behnel.de Fri Jun 29 13:07:27 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jun 2012 13:07:27 +0200 Subject: [Cython] Potential bug: hole in "C <-> Python" conversion In-Reply-To: <20461.33030.605969.344092@localhost.localdomain> References: <20461.29857.232429.210278@localhost.localdomain> <4FED7882.1050008@behnel.de> <20461.33030.605969.344092@localhost.localdomain> Message-ID: <4FED8C6F.5030204@behnel.de> Dieter Maurer, 29.06.2012 12:18: > Stefan Behnel wrote at 2012-6-29 11:42 +0200: >> Also note that libc.string contains declarations for "const char*" and friends. > > Unformatunately Nice word, took me a while to make my brain split the characters correctly. ;) > I need "const unsigned char*" and "const xmlChar *" > (where "xmlChar" is defined as "unsigned char"). Ah, right, libxml2 - an excellent example. lxml is still suffering from the decision of its initial author to ignore C compiler warnings ("for now") and use plain char* instead. Lesson learned: DON'T DO THAT! I recently started cleaning that up (which is why Cython now understands and coerces "unsigned char*" as well), but you wouldn't believe how much work it is to get "const" right after the fact if you have a sufficiently large code base. The current (udiff) patch in my patch queue is some 3000 lines and still growing, but at least the compiler warnings look like they'd soon fit on a single page. That's about the point where I need to start tackling the really tough problems. > I used the "libc.string" definitions as a blueprint for mine. Sure, as long as the types are correct. lxml will have them declared in tree.pxd at some point. BTW, you might want to upgrade to a more recent Cython in any case. 0.13 is almost two years old and lacks a lot of nice language features. lxml 2.4 will use Cython 0.17. Stefan From stefan_ml at behnel.de Fri Jun 29 14:02:29 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 29 Jun 2012 14:02:29 +0200 Subject: [Cython] Potential bug: hole in "C <-> Python" conversion In-Reply-To: <4FED8C6F.5030204@behnel.de> References: <20461.29857.232429.210278@localhost.localdomain> <4FED7882.1050008@behnel.de> <20461.33030.605969.344092@localhost.localdomain> <4FED8C6F.5030204@behnel.de> Message-ID: <4FED9955.1070307@behnel.de> Stefan Behnel, 29.06.2012 13:07: > Dieter Maurer, 29.06.2012 12:18: >> I need "const unsigned char*" and "const xmlChar *" >> (where "xmlChar" is defined as "unsigned char"). > > Ah, right, libxml2 - an excellent example. lxml is still suffering from the > decision of its initial author to ignore C compiler warnings ("for now") > and use plain char* instead. Lesson learned: DON'T DO THAT! I added a doc section about using "const" with "char*". https://sage.math.washington.edu:8091/hudson/job/cython-docs/doclinks/1/src/tutorial/strings.html#dealing-with-const Stefan From robertwb at gmail.com Sat Jun 30 00:38:29 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Fri, 29 Jun 2012 15:38:29 -0700 Subject: [Cython] Automatic C++ conversions In-Reply-To: <4FEC499F.5050505@behnel.de> References: <4FEC29E9.9030609@behnel.de> <4FEC499F.5050505@behnel.de> Message-ID: On Thu, Jun 28, 2012 at 5:10 AM, Stefan Behnel wrote: > Robert Bradshaw, 28.06.2012 12:07: >> On Thu, Jun 28, 2012 at 2:54 AM, Stefan Behnel wrote: >>> Robert Bradshaw, 28.06.2012 10:59: >>>> I've been looking how painful it is to constantly convert between >>>> Python objects and string in C++. >>> >>> You mean std::string (as I think it's called)? Can't we just special case >>> that in the same way that we special case char* and friends? >> >> Yes, we could. > > Then I think it makes sense to do that. Basically, the std::string type > would set its is_string flag and then we need the actual coercion code for it. I just leveraged the object <-> char* conversion in our utility code. >> If we do that it'd make sense to special case list and >> vector and pair and and map and set as well, though perhaps those are >> special enough to hard code them, and it makes the language simpler to >> not have more special methods. > > Ok, then it's > > std::string <=> bytes > std::vector <=> list > std::map <=> dict > std::set <=> set > > Potentially also: > > std::pair => tuple (maybe 2-tuple => std::pair with a runtime length test?) I implemented std::string <=> bytes std::map <=> dict iterable => std::vector => list iterable => std::list => list iterable => std::set => set 2-iterable => std::pair => 2-tuple > What about allowing list() etc.? As long as the item type can > be coerced at compile time, this should be doable: > > => Python iterator > > and it would even be easy to implement in Cython code using a generator > function. The tricky part is memory management; one would have to make sure the iterable is valid as long as the Python object is around (whereas its usually bound to the lifetime of its container). Even more useful, however, would be supporting the "for ... in" syntax for C++ iterators, which I plan to implement soon if no one beats me to it. > The other direction (Python iterator => ) would be > trickier but could also be made to work when the C++ item type on the LHS > of the assignment that triggers the coercion is known at compile time. Yes, this would be actually probably be easier. > We might want to look for a way to make these coercions a "thing" in the > code (maybe through a registry or dedicated class) rather than adding > special casing code everywhere. Perhaps, but that's a rather vague idea with less immediate benefit. The list of obvious cases to support turns out to be rather clear and small. (We already have the from/to_py_function framework.) > I think a CEP would be a good way to specify the above coercions. Though user-extensibility would be a larger topic and certainly deserve a CEP, though I'm not claiming we want to support it. > I also think that this is large enough a feature to openly ask for sponsorship. That depends on the CEP. - Robert From stefan_ml at behnel.de Sat Jun 30 01:06:16 2012 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 30 Jun 2012 01:06:16 +0200 Subject: [Cython] Automatic C++ conversions In-Reply-To: References: <4FEC29E9.9030609@behnel.de> <4FEC499F.5050505@behnel.de> Message-ID: <4FEE34E8.9050807@behnel.de> Robert Bradshaw, 30.06.2012 00:38: > I implemented > > std::string <=> bytes > std::map <=> dict > iterable => std::vector => list > iterable => std::list => list > iterable => std::set => set > 2-iterable => std::pair => 2-tuple Very cool. >> What about allowing list() etc.? As long as the item type can >> be coerced at compile time, this should be doable: >> >> => Python iterator >> >> and it would even be easy to implement in Cython code using a generator >> function. > > The tricky part is memory management; one would have to make sure the > iterable is valid as long as the Python object is around (whereas its > usually bound to the lifetime of its container). Ok, that's a problem then. We won't normally have any control over the container. That makes for-in a much more interesting solution. > Even more useful, however, would be supporting the "for ... in" syntax > for C++ iterators, which I plan to implement soon if no one beats me > to it. Yes, that'll be a warmly appreciated feature, I guess. Please go ahead. :) >> The other direction (Python iterator => ) would be >> trickier but could also be made to work when the C++ item type on the LHS >> of the assignment that triggers the coercion is known at compile time. > > Yes, this would be actually probably be easier. I'm not currently sure about the details, at least the memory management should be easy. But given that we have the container coercions now, this might be a feature of minor interest anyway. >> We might want to look for a way to make these coercions a "thing" in the >> code (maybe through a registry or dedicated class) rather than adding >> special casing code everywhere. > > Perhaps, but that's a rather vague idea with less immediate benefit. > The list of obvious cases to support turns out to be rather clear and > small. (We already have the from/to_py_function framework.) Right. From your code, it turned out to be substantially more local than I thought. >> I think a CEP would be a good way to specify the above coercions. > > Though user-extensibility would be a larger topic and certainly > deserve a CEP, though I'm not claiming we want to support it. > >> I also think that this is large enough a feature to openly ask for sponsorship. > > That depends on the CEP. I think we can continue to postpone this until we actually find a use case where it provides a substantial benefit over what we have now. Similar feature requests have come up several times in the past, but so far, we always got away without it. Stefan From robertwb at gmail.com Sat Jun 30 01:20:21 2012 From: robertwb at gmail.com (Robert Bradshaw) Date: Fri, 29 Jun 2012 16:20:21 -0700 Subject: [Cython] Automatic C++ conversions In-Reply-To: <4FEE34E8.9050807@behnel.de> References: <4FEC29E9.9030609@behnel.de> <4FEC499F.5050505@behnel.de> <4FEE34E8.9050807@behnel.de> Message-ID: On Fri, Jun 29, 2012 at 4:06 PM, Stefan Behnel wrote: > Robert Bradshaw, 30.06.2012 00:38: >> I implemented >> >> std::string <=> bytes >> std::map <=> dict >> iterable => std::vector => list >> iterable => std::list => list >> iterable => std::set => set >> 2-iterable => std::pair => 2-tuple > > Very cool. > > >>> What about allowing list() etc.? As long as the item type can >>> be coerced at compile time, this should be doable: >>> >>> => Python iterator >>> >>> and it would even be easy to implement in Cython code using a generator >>> function. >> >> The tricky part is memory management; one would have to make sure the >> iterable is valid as long as the Python object is around (whereas its >> usually bound to the lifetime of its container). > > Ok, that's a problem then. We won't normally have any control over the > container. That makes for-in a much more interesting solution. > > >> Even more useful, however, would be supporting the "for ... in" syntax >> for C++ iterators, which I plan to implement soon if no one beats me >> to it. > > Yes, that'll be a warmly appreciated feature, I guess. Please go ahead. :) > > >>> The other direction (Python iterator => ) would be >>> trickier but could also be made to work when the C++ item type on the LHS >>> of the assignment that triggers the coercion is known at compile time. >> >> Yes, this would be actually probably be easier. > > I'm not currently sure about the details, at least the memory management > should be easy. But given that we have the container coercions now, this > might be a feature of minor interest anyway. > > >>> We might want to look for a way to make these coercions a "thing" in the >>> code (maybe through a registry or dedicated class) rather than adding >>> special casing code everywhere. >> >> Perhaps, but that's a rather vague idea with less immediate benefit. >> The list of obvious cases to support turns out to be rather clear and >> small. (We already have the from/to_py_function framework.) > > Right. From your code, it turned out to be substantially more local than I > thought. And kudos to Mark for templatized cython utility code so I didn't have to re-implement all that iterating in C. >>> I think a CEP would be a good way to specify the above coercions. >> >> Though user-extensibility would be a larger topic and certainly >> deserve a CEP, though I'm not claiming we want to support it. >> >>> I also think that this is large enough a feature to openly ask for sponsorship. >> >> That depends on the CEP. > > I think we can continue to postpone this until we actually find a use case > where it provides a substantial benefit over what we have now. Similar > feature requests have come up several times in the past, but so far, we > always got away without it. 100% agree with you here. - Robert From d.s.seljebotn at astro.uio.no Sat Jun 30 12:57:49 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 30 Jun 2012 12:57:49 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: References: <4FCD100B.7000008@astro.uio.no> <4FCFC441.40703@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> Message-ID: <4FEEDBAD.2000507@astro.uio.no> My time is rather limited but I'm slowly trying to get another SEP 200 in place. Something that hit me, when I tried to make up my mind about whether to have (key, ptr) entries or (key, flags, ptr), is that the fast hash table entries can actually be arbitrary size. So we could make the table itself void *table[n] and then n would be a power of 2 (TBD: benchmark cost of allowing other sizes). Since we have the d[i] displacements, it's no problem at all to construct displacements to account for variable-size entries. Proposal: C-source for an un-initialized table (signature string is placeholder and not up for discussion now): { "3:method:foo:i4i4->i4", (void*)EXCEPT_STAR_FLAG, &foo_method, "2:numpy:SHAPE", &get_shape_method, "2:fieldoffset:barfield", (void*)5, 0 /*padding to n=2^k*/ } I.e. all keys are prepended by the number of slots they use. So methods get to use 3 sizeof(void*) slots since they need the flags, but entries that don't need flags use only 2 slots. (In this case, "numpy:SHAPE" is a protocol defined by NumPy and so doesn't need any flags; or the flags are stored under "numpy:FLAGS" by that protocol.) Then, PyExtensibleType_Ready parses this and rearranges the table to a perfect hash-table. As part of that, it parses the string literal keys and interns them, so that the number of slots becomes available in a more coder-friendly manner: typedef { uint64_t hash; /* lower-64 bit of md5 */ uint32_t strlen; /* we allow \0 in key */ uint8_t nslots; /* set to 3 for first example */ char *key; /* set to "method:foo:i4i4->i4" */ } fasttable_key_t; Then, the interned keys for the table is the fasttable_key_t*. Storing the hash inside the key has two pros: - Caching the md5 work (provided the interner uses a faster hash function to go from string to Lookup would happen like this: typedef { fasttable_key_t *key; uintptr_t flags; void *funcptr } method_t; (method_t*)PyCustomSlots_Find(mykey, mykey->hash); /* or, faster: */ (method_t*)PyCustomSlots_Find(mykey, 0x45343453453fabaULL); If you want to scan the table linearly (to avoid having to bother with getting an interned key), you would scan a table of void*, and for every entry cast the key to fasttable_key_t* and check nslots for how much to skip to get to the next entry. Too complicated? Dag From d.s.seljebotn at astro.uio.no Sat Jun 30 13:01:07 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 30 Jun 2012 13:01:07 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FEEDBAD.2000507@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> <4FEEDBAD.2000507@astro.uio.no> Message-ID: <4FEEDC73.4040900@astro.uio.no> On 06/30/2012 12:57 PM, Dag Sverre Seljebotn wrote: > My time is rather limited but I'm slowly trying to get another SEP 200 > in place. > > Something that hit me, when I tried to make up my mind about whether to > have (key, ptr) entries or (key, flags, ptr), is that the fast hash > table entries can actually be arbitrary size. So we could make the table > itself > > void *table[n] > > and then n would be a power of 2 (TBD: benchmark cost of allowing other > sizes). Since we have the d[i] displacements, it's no problem at all to > construct displacements to account for variable-size entries. > > Proposal: > > C-source for an un-initialized table (signature string is placeholder > and not up for discussion now): > > { "3:method:foo:i4i4->i4", (void*)EXCEPT_STAR_FLAG, &foo_method, > "2:numpy:SHAPE", &get_shape_method, > "2:fieldoffset:barfield", (void*)5, 0 /*padding to n=2^k*/ } > > I.e. all keys are prepended by the number of slots they use. So methods > get to use 3 sizeof(void*) slots since they need the flags, but entries > that don't need flags use only 2 slots. (In this case, "numpy:SHAPE" is > a protocol defined by NumPy and so doesn't need any flags; or the flags > are stored under "numpy:FLAGS" by that protocol.) > > Then, PyExtensibleType_Ready parses this and rearranges the table to a > perfect hash-table. As part of that, it parses the string literal keys > and interns them, so that the number of slots becomes available in a > more coder-friendly manner: > > typedef { > uint64_t hash; /* lower-64 bit of md5 */ > uint32_t strlen; /* we allow \0 in key */ > uint8_t nslots; /* set to 3 for first example */ > char *key; /* set to "method:foo:i4i4->i4" */ > } fasttable_key_t; > > Then, the interned keys for the table is the fasttable_key_t*. > > Storing the hash inside the key has two pros: > - Caching the md5 work (provided the interner uses a faster hash > function to go from string to Sorry. Storing the hash inside the key has two pros: - Caching the md5 work (provided the interner uses a faster hash function to go from string to key "object") - You don't always have to store both key and hash in global variables (see below), you can dereference the key for the hash if you want to. Dag > > Lookup would happen like this: > > typedef { > fasttable_key_t *key; > uintptr_t flags; > void *funcptr > } method_t; > > (method_t*)PyCustomSlots_Find(mykey, mykey->hash); > /* or, faster: */ > (method_t*)PyCustomSlots_Find(mykey, 0x45343453453fabaULL); > > If you want to scan the table linearly (to avoid having to bother with > getting an interned key), you would scan a table of void*, and for every > entry cast the key to fasttable_key_t* and check nslots for how much to > skip to get to the next entry. > > Too complicated? > > Dag From d.s.seljebotn at astro.uio.no Sat Jun 30 13:19:25 2012 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Sat, 30 Jun 2012 13:19:25 +0200 Subject: [Cython] Hash-based vtables In-Reply-To: <4FEEDC73.4040900@astro.uio.no> References: <4FCD100B.7000008@astro.uio.no> <4FCFCD49.9030802@astro.uio.no> <4FD0808B.5080300@astro.uio.no> <4FD083F9.2030006@astro.uio.no> <4FD26ADA.5060401@astro.uio.no> <4FD2E313.6040208@astro.uio.no> <6c423841-b888-478d-8b89-148f3e9bd60e@email.android.com> <4FD45424.9040909@astro.uio.no> <4FD45E31.8060506@astro.uio.no> <4FD72199.7010803@astro.uio.no> <4FD77AAC.6080905@astro.uio.no> <4FEEDBAD.2000507@astro.uio.no> <4FEEDC73.4040900@astro.uio.no> Message-ID: <4FEEE0BD.8080302@astro.uio.no> On 06/30/2012 01:01 PM, Dag Sverre Seljebotn wrote: > On 06/30/2012 12:57 PM, Dag Sverre Seljebotn wrote: >> My time is rather limited but I'm slowly trying to get another SEP 200 >> in place. >> >> Something that hit me, when I tried to make up my mind about whether to >> have (key, ptr) entries or (key, flags, ptr), is that the fast hash >> table entries can actually be arbitrary size. So we could make the table >> itself >> >> void *table[n] >> >> and then n would be a power of 2 (TBD: benchmark cost of allowing other >> sizes). Since we have the d[i] displacements, it's no problem at all to >> construct displacements to account for variable-size entries. >> >> Proposal: >> >> C-source for an un-initialized table (signature string is placeholder >> and not up for discussion now): >> >> { "3:method:foo:i4i4->i4", (void*)EXCEPT_STAR_FLAG, &foo_method, >> "2:numpy:SHAPE", &get_shape_method, >> "2:fieldoffset:barfield", (void*)5, 0 /*padding to n=2^k*/ } >> >> I.e. all keys are prepended by the number of slots they use. So methods >> get to use 3 sizeof(void*) slots since they need the flags, but entries >> that don't need flags use only 2 slots. (In this case, "numpy:SHAPE" is >> a protocol defined by NumPy and so doesn't need any flags; or the flags >> are stored under "numpy:FLAGS" by that protocol.) >> >> Then, PyExtensibleType_Ready parses this and rearranges the table to a >> perfect hash-table. As part of that, it parses the string literal keys >> and interns them, so that the number of slots becomes available in a >> more coder-friendly manner: >> >> typedef { >> uint64_t hash; /* lower-64 bit of md5 */ >> uint32_t strlen; /* we allow \0 in key */ >> uint8_t nslots; /* set to 3 for first example */ >> char *key; /* set to "method:foo:i4i4->i4" */ >> } fasttable_key_t; An idea that could be entertained is make the hash e.g. sha-256, and store the entire 256 bits here, so that the interning procedure didn't need to strcmp the entire key string on hash collisions. I imagine that interface definitions etc. could make for rather large string keys, and one would also want to use the sha-256 to point to other interfaces. OTOH, one needs to run through the entire key to construct the cheaper hash needed to go from char* to fasttable_key_t* anyway, so perhaps there's not much point in this, it's only a factor 2x or 3x. Dag >> >> Then, the interned keys for the table is the fasttable_key_t*. >> >> Storing the hash inside the key has two pros: >> - Caching the md5 work (provided the interner uses a faster hash >> function to go from string to > > Sorry. > > Storing the hash inside the key has two pros: > > - Caching the md5 work (provided the interner uses a faster hash > function to go from string to key "object") > > - You don't always have to store both key and hash in global variables > (see below), you can dereference the key for the hash if you want to. > > Dag > >> >> Lookup would happen like this: >> >> typedef { >> fasttable_key_t *key; >> uintptr_t flags; >> void *funcptr >> } method_t; >> >> (method_t*)PyCustomSlots_Find(mykey, mykey->hash); >> /* or, faster: */ >> (method_t*)PyCustomSlots_Find(mykey, 0x45343453453fabaULL); >> >> If you want to scan the table linearly (to avoid having to bother with >> getting an interned key), you would scan a table of void*, and for every >> entry cast the key to fasttable_key_t* and check nslots for how much to >> skip to get to the next entry. >> >> Too complicated? >> >> Dag > > _______________________________________________ > cython-devel mailing list > cython-devel at python.org > http://mail.python.org/mailman/listinfo/cython-devel