[Cython] Hash-based vtables
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Sat Jun 9 08:00:50 CEST 2012
On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote:
> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>> There's still the indirection through SEP 200 (extensibletype slots).
>>> We can
>>> get rid of that very easily by just making that table and the
>>> one and the same. (It could still either have interned string keys or ID
>>> keys depending on the least significant bit.)
>> Or we can even forgo the interning for this table, and give up on
>> partitioning the space numerically and just use the dns-style
>> prefixing, e.g. "org.cython.X" belongs to us.
> Huh? Isn't that when you *need* interning? Do you plan on key-encoding
> those kind of strings into 64 bits?
> (I think it would usually be "method:foo:ii->d" (or my current
> preference is "method:foo:i4i8->f8"))
Well, I guess something like "org.cython.X" would happen often as well,
in addition. Just put it all in the same table :-)
> Partitioning the space numerically you'd just hash the number; "SEP 260:
> We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".
The real use-case I see for this now is in having the PyArray_DATA etc.
access pointers simply through compile-time constants the library can
define on both ends. It could just do
PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL)
specifically to get a function retrieving the data-pointer.
PyArray_SHAPE would do
PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL)
Also, I'd want PyExtensibleType_Object to have:
i.e. we allow for getting quickly to a table on the object in addition
to the one on the type.
Callbacks look up the one on the object first (before potentially
checking for __call__ in the type); method-calling might ignore the one
on the object.
>> There is value in the double indirection if this (or any of the other)
>> lookup tables are meant to be modified over time.
> This isn't impossible with a hash table either. You just need to
> reallocate a little more often than what would be the case with a
> regular hash table, but not dramatically so (you need to rehash whenever
> the element to insert hashes into a "large" bin, which are rather few).
> I want the table to have a pointer to it, so that you can atomically
> swap it out.
>>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>>> spec". It's at the point where you don't really want to have several
>>> libraries implementing the same simple spec, but instead use the same
>>> But I think the advantages are simply too good to give up on.
>>> So I think a viable route forward is to forget the
>>> for now (which only works for semi-complicated ideas with simple
>>> implementations) and instead simply work more directly on a library. It
>>> would need to have a couple of different use modes:
>> I prefer an enhancement proposal with a spec over a library, even if
>> only a single library gets used in practice. I still think it's simple
>> enough. Basically, we have the "lookup spec" and then a CEP for
>> applying this to fast callable (agreeing on signatures, and what to do
>> with extern types) and extensible type slots.
>>> - A Python perfect-hasher for use when generating code, with only the a
>>> string interner based on CPython dicts and extensibletype metaclass as
>>> runtime dependencies (for use in Cython). This would only add some
>>> source file lines...
>>> - A C implementation of the perfect hashing exposed through a
>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>> implementation of the perfect hashing.
>>> And on the distribution axis:
>>> - Small C header-style implementation of a string interner and the
>>> extensibletype metaclass, rendezvousing through sys.modules
>>> - As part of the rendezvous, one would always try to __import__ the
>>> run-time library. So if it is available in sys.path it overrides
>>> bundled with other libraries. That would provide a way forward for
>>> string interning, a Python-side library for working with these tables
>>> inspecting them, etc.
>> Hmm, that's an interesting idea. I think we don't actually need
>> interning, in which case the "full" library is only needed for
> You don't believe the security concern is real then? Or do you want to
> pay the cost for a 160-bit SHA1 compare everywhere?
> I'd love to not do interning, but I see no way around it.
> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
> through a preprocessor with
> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
> and the result is 180 lines of code for the hash table. Then pythread.h
> provides the thread lock, another 50 lines for the interning logic
> (intern_literal, intern_heap_allocated, release_interned).
> It just seems a little redundant to ship such a thing in every
> Cython-generated file since we hold the GIL during module loading anyway.
> cython-devel mailing list
> cython-devel at python.org
More information about the cython-devel