[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sun Jun 10 09:14:43 CEST 2012



Robert Bradshaw <robertwb at gmail.com> wrote:

>On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
><d.s.seljebotn at astro.uio.no> wrote:
>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>>>
>>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>>> There's still the indirection through SEP 200 (extensibletype
>slots). We
>>>> can
>>>> get rid of that very easily by just making that table and the
>hash-vtable
>>>> one and the same. (It could still either have interned string keys
>or ID
>>>> keys depending on the least significant bit.)
>>>
>>>
>>> Or we can even forgo the interning for this table, and give up on
>>> partitioning the space numerically and just use the dns-style
>>> prefixing, e.g. "org.cython.X" belongs to us.
>>
>>
>> Huh? Isn't that when you *need* interning? Do you plan on
>key-encoding those
>> kind of strings into 64 bits?
>
>No, use 64-bits of a a cryptographically-secure hash.
>
>> (I think it would usually be "method:foo:ii->d" (or my current
>preference is
>> "method:foo:i4i8->f8"))
>
>Yeah, I was assuming methods wouldn't be specific to Cython. (Putting
>sizes in the format makes a lot of sense for persistent storage, but I
>think it's safe to assume that a"long in the provider == a long in the
>consumer, and this would mean the hashes would have to be computed
>after (some) C compilation).
>
>> Partitioning the space numerically you'd just hash the number; "SEP
>260: We
>> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".
>
>But why bother with the id?
>
>>> There is value in the double indirection if this (or any of the
>other)
>>> lookup tables are meant to be modified over time.
>>
>>
>> This isn't impossible with a hash table either. You just need to
>reallocate
>> a little more often than what would be the case with a regular hash
>table,
>> but not dramatically so (you need to rehash whenever the element to
>insert
>> hashes into a "large" bin, which are rather few).
>>
>> I want the table to have a pointer to it, so that you can atomically
>swap it
>> out.
>
>I think that's worth a level of indirection.
>
>>>> To wrap up, I think this has grown in complexity beyond the "simple
>SEP
>>>> spec". It's at the point where you don't really want to have
>several
>>>> libraries implementing the same simple spec, but instead use the
>same
>>>> implementation.
>>>>
>>>> But I think the advantages are simply too good to give up on.
>>>>
>>>> So I think a viable route forward is to forget the
>>>> CEP/SEP/pre-PEP-approach
>>>> for now (which only works for semi-complicated ideas with simple
>>>> implementations) and instead simply work more directly on a
>library. It
>>>> would need to have a couple of different use modes:
>>>
>>>
>>> I prefer an enhancement proposal with a spec over a library, even if
>>> only a single library gets used in practice. I still think it's
>simple
>>> enough. Basically, we have the "lookup spec" and then a CEP for
>>> applying this to fast callable (agreeing on signatures, and what to
>do
>>> with extern types) and extensible type slots.
>>
>>
>> OK.
>>
>>
>>>
>>>>  - A Python perfect-hasher for use when generating code, with only
>the a
>>>> string interner based on CPython dicts and extensibletype metaclass
>as
>>>> runtime dependencies (for use in Cython). This would only add some
>>>> hundred
>>>> source file lines...
>>>>
>>>>  - A C implementation of the perfect hashing exposed through a
>>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>>> implementation of the perfect hashing.
>>>>
>>>> And on the distribution axis:
>>>>
>>>>  - Small C header-style implementation of a string interner and the
>>>> extensibletype metaclass, rendezvousing through sys.modules
>>>>
>>>>  - As part of the rendezvous, one would always try to __import__
>the
>>>> *real*
>>>> run-time library. So if it is available in sys.path it overrides
>anything
>>>> bundled with other libraries. That would provide a way forward for
>>>> GIL-less
>>>> string interning, a Python-side library for working with these
>tables and
>>>> inspecting them, etc.
>>>
>>>
>>> Hmm, that's an interesting idea. I think we don't actually need
>>> interning, in which case the "full" library is only needed for
>>> introspection.
>>
>>
>> You don't believe the security concern is real then? Or do you want
>to pay
>> the cost for a 160-bit SHA1 compare everywhere?
>>
>> I'd love to not do interning, but I see no way around it.
>
>No, I want to use the lower 64 bits by default, but always have the
>top 96 bits around to allow using this mechanism in "secure" mode at a
>slight penalty. md5 is out because there are known collisions. (Yes,
>sha-1 may succumb sooner rather than later, theoretical weaknesses
>have been shown, so we could look to using something else (hopefully
>still shipped with Python).

But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules?

In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay.

And if you *do* want to run in secure mode, it will be a lot slower than interning.

Dag

>
>> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
>> through a preprocessor with
>>
>> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
>>
>> and the result is 180 lines of code for the hash table. Then
>pythread.h
>> provides the thread lock, another 50 lines for the interning logic
>> (intern_literal, intern_heap_allocated, release_interned).
>>
>> It just seems a little redundant to ship such a thing in every
>> Cython-generated file since we hold the GIL during module loading
>anyway.
>
>It's the rendezvous on the global state that's more messy then the
>locking (though we do already require that for the metaclass approach
>of detecting types with extended slots).
>
>- Robert
>_______________________________________________
>cython-devel mailing list
>cython-devel at python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


More information about the cython-devel mailing list