[Cython] Hash-based vtables

Sun Jun 10 09:00:44 CEST 2012

On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>>
>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>> There's still the indirection through SEP 200 (extensibletype slots). We
>>> can
>>> get rid of that very easily by just making that table and the hash-vtable
>>> one and the same. (It could still either have interned string keys or ID
>>> keys depending on the least significant bit.)
>>
>>
>> Or we can even forgo the interning for this table, and give up on
>> partitioning the space numerically and just use the dns-style
>> prefixing, e.g. "org.cython.X" belongs to us.
>
>
> Huh? Isn't that when you *need* interning? Do you plan on key-encoding those
> kind of strings into 64 bits?

No, use 64-bits of a a cryptographically-secure hash.

> (I think it would usually be "method:foo:ii->d" (or my current preference is
> "method:foo:i4i8->f8"))

Yeah, I was assuming methods wouldn't be specific to Cython. (Putting
sizes in the format makes a lot of sense for persistent storage, but I
think it's safe to assume that a"long in the provider == a long in the
consumer, and this would mean the hashes would have to be computed
after (some) C compilation).

> Partitioning the space numerically you'd just hash the number; "SEP 260: We
> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".

But why bother with the id?

>> There is value in the double indirection if this (or any of the other)
>> lookup tables are meant to be modified over time.
>
>
> This isn't impossible with a hash table either. You just need to reallocate
> a little more often than what would be the case with a regular hash table,
> but not dramatically so (you need to rehash whenever the element to insert
> hashes into a "large" bin, which are rather few).
>
> I want the table to have a pointer to it, so that you can atomically swap it
> out.

I think that's worth a level of indirection.

>>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>>> spec". It's at the point where you don't really want to have several
>>> libraries implementing the same simple spec, but instead use the same
>>> implementation.
>>>
>>> But I think the advantages are simply too good to give up on.
>>>
>>> So I think a viable route forward is to forget the
>>> CEP/SEP/pre-PEP-approach
>>> for now (which only works for semi-complicated ideas with simple
>>> implementations) and instead simply work more directly on a library. It
>>> would need to have a couple of different use modes:
>>
>>
>> I prefer an enhancement proposal with a spec over a library, even if
>> only a single library gets used in practice. I still think it's simple
>> enough. Basically, we have the "lookup spec" and then a CEP for
>> applying this to fast callable (agreeing on signatures, and what to do
>> with extern types) and extensible type slots.
>
>
> OK.
>
>
>>
>>>  - A Python perfect-hasher for use when generating code, with only the a
>>> string interner based on CPython dicts and extensibletype metaclass as
>>> runtime dependencies (for use in Cython). This would only add some
>>> hundred
>>> source file lines...
>>>
>>>  - A C implementation of the perfect hashing exposed through a
>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>> implementation of the perfect hashing.
>>>
>>> And on the distribution axis:
>>>
>>>  - Small C header-style implementation of a string interner and the
>>> extensibletype metaclass, rendezvousing through sys.modules
>>>
>>>  - As part of the rendezvous, one would always try to __import__ the
>>> *real*
>>> run-time library. So if it is available in sys.path it overrides anything
>>> bundled with other libraries. That would provide a way forward for
>>> GIL-less
>>> string interning, a Python-side library for working with these tables and
>>> inspecting them, etc.
>>
>>
>> Hmm, that's an interesting idea. I think we don't actually need
>> interning, in which case the "full" library is only needed for
>> introspection.
>
>
> You don't believe the security concern is real then? Or do you want to pay
> the cost for a 160-bit SHA1 compare everywhere?
>
> I'd love to not do interning, but I see no way around it.

No, I want to use the lower 64 bits by default, but always have the
top 96 bits around to allow using this mechanism in "secure" mode at a
slight penalty. md5 is out because there are known collisions. (Yes,
sha-1 may succumb sooner rather than later, theoretical weaknesses
have been shown, so we could look to using something else (hopefully
still shipped with Python).

> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
> through a preprocessor with
>
> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
>
> and the result is 180 lines of code for the hash table. Then pythread.h
> provides the thread lock, another 50 lines for the interning logic
> (intern_literal, intern_heap_allocated, release_interned).
>
> It just seems a little redundant to ship such a thing in every
> Cython-generated file since we hold the GIL during module loading anyway.

It's the rendezvous on the global state that's more messy then the
locking (though we do already require that for the metaclass approach
of detecting types with extended slots).

- Robert