Mailman 3 Multi-index Containers Library - Python-ideas

Multi-index Containers Library

SF Markus Elfring

Dec. 1, 2014

12:13 p.m.

Hello, I find a data structure like it is provided by the "Boost Multi-index Containers Library" interesting for efficient data processing. http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html How are the chances to integrate a class library with similar functionality into the Python software infrastructure? https://bugs.python.org/issue22957 Regards, Markus

Show replies by date

David Wilson

December 2014

2:24 p.m.

On Mon, Dec 01, 2014 at 01:13:41PM +0100, SF Markus Elfring wrote:

...

Hello,

I find a data structure like it is provided by the "Boost Multi-index Containers Library" interesting for efficient data processing. http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html

This came up on #python just last week, and I've personally encountered a desire for something similar a bunch of times. Probably, though, the ideal interface to something like this would need some time to evolve before it would be useful to include in the stdlib, along with some motivating use cases -- at least, it would be nice if there were one or two third party implementations to compare before settling on an interface. I can't think of any existing implementations that operate on purely in-memory structures. Boost::multiindex is pretty awesome, but its interface is also somewhat complicated, and a 1:1 translation wouldn't make for a pretty Python experience. I thought about this a little bit last week, initially starting with a "MultiSet" type interface, where the "default" interface for the container emulated a regular set object. But that would prevent having sets of things unless they were hashable (instead of, say, merely having properties that were orderable). In applications where I'd like to use a structure like this, often the element type is mutable. My second idea was for something like an "IndexedList" type, where the default interface mimicked the built-in list. This would allow generic use of the container with code that wasn't expecting it, and avoid the hashability issue from before. Perhaps something like: lst = IndexedList() lst.add_unique_index('id', lambda o: o['id']) lst.add_index('age', lambda o: o['age']) lst.append({'id': 1, 'age': 50}) lst.append({'id': 1, 'age': 50}) # KeyError lst.append({'id': 2, 'age': 51}) assert lst.indices['id'][1]['age'] == 50 slice = lst.indices['age'][40:50] # No idea what would be returned here, some kind of view perhaps. assert len(slice) == 2 Following on from the blist thread a few months back, this is another example of a container that would be easy to construct if Python had some kind of built-in ordered mapping type. David

...

How are the chances to integrate a class library with similar functionality into the Python software infrastructure? https://bugs.python.org/issue22957

Regards, Markus _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

David Wilson

3:03 p.m.

...

before it would be useful to include in the stdlib, along with some motivating use cases

One example would be a continuous auction, like a stock exchange order book. In that case, efficient enumeration is desirable by all of account ID, order ID, or (price, time). book = IndexedList() book.add_unique_index('order_id', lambda o: o.id) book.add_index('account_id', lambda o: o.account_id) book.add_index('price_time', lambda o: (o.price, o.time)) def add_order(order): book.append(order) def del_order(order): del book.indices['order_id'][order.id] def close_account(account_id): del book.indices['account_id'][account_id] def iter_matches(limit_price): for (price, time), order in book.indices['price_time']: if price > limit_price: break yield order Again it's not very clear how the magical 'indices' object slices would work. David

Chris Angelico

3:13 p.m.

On Tue, Dec 2, 2014 at 2:03 AM, David Wilson <dw+python-ideas@hmmz.org> wrote:

...

...
before it would be useful to include in the stdlib, along with some motivating use cases

One example would be a continuous auction, like a stock exchange order book. In that case, efficient enumeration is desirable by all of account ID, order ID, or (price, time).

For small numbers of entries, it'd be simpler to just sort and filter on demand; for large numbers of entries, you should probably be using a database, which will have these sorts of facilities. Is there a mid-range where it's better to keep it all in memory, but it's too slow to sort on demand? ChrisA

Jeff Hardy

9:59 a.m.

On Mon, Dec 1, 2014 at 3:13 PM, Chris Angelico <rosuav@gmail.com> wrote:

...

On Tue, Dec 2, 2014 at 2:03 AM, David Wilson <dw+python-ideas@hmmz.org> wrote:

...
...
before it would be useful to include in the stdlib, along with some motivating use cases

One example would be a continuous auction, like a stock exchange order book. In that case, efficient enumeration is desirable by all of account ID, order ID, or (price, time).

For small numbers of entries, it'd be simpler to just sort and filter on demand; for large numbers of entries, you should probably be using a database, which will have these sorts of facilities. Is there a mid-range where it's better to keep it all in memory, but it's too slow to sort on demand?

My current work project keeps ~8 GB of data in RAM (and looking at 64-128GB servers to get us through the next 3 years). Sorting on demand would be way too slow but it doesn't need to be in a database either - it can be reconstructed from an event stream, and running a DB server is extra ops. Using an in-memory DB like SQLite is an unnecessary extra layer. Currently it just has maps as needed to speed up queries, and I don't think it fits the use case for an MI container, but other projects might. With the amount of RAM in modern machines "keep it all in memory" is viable for lots of use cases with otherwise large values of N. - Jeff

SF Markus Elfring

1:10 p.m.

...

Sorting on demand would be way too slow but it doesn't need to be in a database either - it can be reconstructed from an event stream, and running a DB server is extra ops.

I am looking also for ways to omit persistent storage for a specific data processing task. How often will simple index management be sufficient for some software applications?

...

Using an in-memory DB like SQLite is an unnecessary extra layer.

How many in-memory data bases can be reused by the Python software infrastructure? Would you like to point any more class libraries out? Regards, Markus

David Townshend

12:18 p.m.

FWIW, a couple of years ago I wrote an implementation of a multi-index container with an ORM-ish interface ( http://norman.readthedocs.org/en/latest/), more for my own amusement than any practical purposes. However, I found that in my work the applications were actually rather limited and generally better served by sqlalchemy + appropriate db, or pandas (depending on use case). I haven't touched the code in quite a while and there may be some bugs that I never discovered, but it might still be of some use. David On Wed, Dec 3, 2014 at 3:10 PM, SF Markus Elfring < elfring@users.sourceforge.net> wrote:

...

...
Sorting on demand would be way too slow but it doesn't need to be in a database either - it can be reconstructed from an event stream, and running a DB server is extra ops.

I am looking also for ways to omit persistent storage for a specific data processing task. How often will simple index management be sufficient for some software applications?

...
Using an in-memory DB like SQLite is an unnecessary extra layer.

How many in-memory data bases can be reused by the Python software infrastructure? Would you like to point any more class libraries out?

Regards, Markus _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Todd

3:49 p.m.

On Dec 1, 2014 4:03 PM, "David Wilson" <dw+python-ideas@hmmz.org> wrote:

...

...
before it would be useful to include in the stdlib, along with some motivating use cases

One example would be a continuous auction, like a stock exchange order book. In that case, efficient enumeration is desirable by all of account ID, order ID, or (price, time).

book = IndexedList() book.add_unique_index('order_id', lambda o: o.id) book.add_index('account_id', lambda o: o.account_id) book.add_index('price_time', lambda o: (o.price, o.time))

def add_order(order): book.append(order)

def del_order(order): del book.indices['order_id'][order.id]

def close_account(account_id): del book.indices['account_id'][account_id]

def iter_matches(limit_price): for (price, time), order in book.indices['price_time']: if price > limit_price: break yield order

Again it's not very clear how the magical 'indices' object slices would work.

You can also use pandas for this.

SF Markus Elfring

3:12 p.m.

...

...
http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html

This came up on #python just last week, and I've personally encountered a desire for something similar a bunch of times.

How challenging is it to map powerful Boost classes to the Python programming language?

...

Probably, though, the ideal interface to something like this would need some time to evolve before it would be useful to include in the stdlib,

I am curious on how this could happen ...

...

along with some motivating use cases

Would you like to look at any more examples I could show?

...

-- at least, it would be nice if there were one or two third party implementations to compare before settling on an interface. I can't think of any existing implementations that operate on purely in-memory structures.

Thanks for your feedback.

...

Boost::multiindex is pretty awesome, but its interface is also somewhat complicated, and a 1:1 translation wouldn't make for a pretty Python experience.

Will this story become another promising software development adventure? Regards, Markus

Grant Jenks

8:14 p.m.

As the author of the sortedcontainers project (similar to blist, but pure-Python and often faster), I thought this problem would be an interesting application of the library so here goes. I'll define an IndexedList as a container of (pk, item) pairs with at least one unique index mapping pk -> pair like so: from sortedcontainers import SortedListWithKey, SortedDict from operator import itemgetter class IndexedList(object): def __init__(self, *args, **kwargs): self.pk = 0 self.indices = {} self.uniques = {'_pk': SortedDict()} self.keys = {'_pk': itemgetter(0)} Different types of indices are used based on the unique requirement. In general, I'd use a list since an index might contain duplicates but with the unique constraint, I can use a dict: def add_unique_index(self, name, key): key = lambda pair: key(pair[1]) self.uniques[name] = SortedDict(zip(map(key, self), self)) self.keys[name] = key When we have to use a list, we'll store (key, pair) tuples and keep them sorted by the key. def add_index(self, name, key): key = lambda pair: key(pair[1]) items = zip(map(key, self), self) self.indices[name] = SortedListWithKey(items, key=itemgetter(0)) When we iterate, we'll return the pairs: def __iter__(self): return self.uniques['_pk'].values() The append operation means creating a new pair and updating the indices: def append(self, item): self.pk += 1 pair = (self.pk, item) for name in self.uniques: self.uniques[name][self.keys[name](pair)] = pair for name in self.indices: self.indices[name].add(pair) If we want to remove a unique key, we first find the item and then update the indices: def remove_unique(self, name, key): item = self.uniques[name][key] for name in self.uniques: del self.uniques[name][self.keys[name](item)] for name in self.indices: self.indices[name].remove(item) Removing a key from a general index is a little more difficult because there may be many matching items. Because we stored tuples in the index, we can bisect to find the range of items with matching keys and then remove them: def remove_index(self, name, key): start = self.indices[name].bisect_left((key,)) end = self.indices[name].bisect_right((key,)) items = map(itemgetter(1), self.indices[name][start:end]) for item in items: self.remove_unique('_pk', item[0]) Finally, David's original api needs only a bit of modification, everything else remains the same: def del_order(order): book.remove_unique('order_id', order.id) def close_account(account_id): book.remove_index('account_id', account_id) Here's the whole recipe on pastebin*: http://pastebin.com/8S71W3Ue And a link to the SortedContainers project: http://www.grantjenks.com/docs/sortedcontainers/ Grant * Beware, I haven't benchmarked or tested this code. On Mon, Dec 1, 2014 at 7:12 AM, SF Markus Elfring < elfring@users.sourceforge.net> wrote:

...

...
...
http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html

This came up on #python just last week, and I've personally encountered a desire for something similar a bunch of times.

How challenging is it to map powerful Boost classes to the Python programming language?

...
Probably, though, the ideal interface to something like this would need some time to evolve before it would be useful to include in the stdlib,

I am curious on how this could happen ...

...
along with some motivating use cases

Would you like to look at any more examples I could show?

...
-- at least, it would be nice if there were one or two third party implementations to compare before settling on an interface. I can't think of any existing implementations that operate on purely in-memory structures.

Thanks for your feedback.

...
Boost::multiindex is pretty awesome, but its interface is also somewhat complicated, and a 1:1 translation wouldn't make for a pretty Python experience.

Will this story become another promising software development adventure?

Regards, Markus _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Wes Turner

3:39 p.m.

Is this what pandas MultiIndex does (in Cython)? * http://pandas.pydata.org/pandas-docs/dev/api.html#reshaping-sorting * http://toolz.readthedocs.org/en/latest/api.html On Mon, Dec 1, 2014 at 8:24 AM, David Wilson <dw+python-ideas@hmmz.org> wrote:

...

On Mon, Dec 01, 2014 at 01:13:41PM +0100, SF Markus Elfring wrote:

...
Hello,

I find a data structure like it is provided by the "Boost Multi-index Containers Library" interesting for efficient data processing. http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html

This came up on #python just last week, and I've personally encountered a desire for something similar a bunch of times. Probably, though, the ideal interface to something like this would need some time to evolve before it would be useful to include in the stdlib, along with some motivating use cases -- at least, it would be nice if there were one or two third party implementations to compare before settling on an interface. I can't think of any existing implementations that operate on purely in-memory structures.

Boost::multiindex is pretty awesome, but its interface is also somewhat complicated, and a 1:1 translation wouldn't make for a pretty Python experience.

I thought about this a little bit last week, initially starting with a "MultiSet" type interface, where the "default" interface for the container emulated a regular set object. But that would prevent having sets of things unless they were hashable (instead of, say, merely having properties that were orderable). In applications where I'd like to use a structure like this, often the element type is mutable.

My second idea was for something like an "IndexedList" type, where the default interface mimicked the built-in list. This would allow generic use of the container with code that wasn't expecting it, and avoid the hashability issue from before.

Perhaps something like:

lst = IndexedList() lst.add_unique_index('id', lambda o: o['id']) lst.add_index('age', lambda o: o['age'])

lst.append({'id': 1, 'age': 50}) lst.append({'id': 1, 'age': 50}) # KeyError

lst.append({'id': 2, 'age': 51})

assert lst.indices['id'][1]['age'] == 50

slice = lst.indices['age'][40:50] # No idea what would be returned here, some kind of view perhaps. assert len(slice) == 2

Following on from the blist thread a few months back, this is another example of a container that would be easy to construct if Python had some kind of built-in ordered mapping type.

David

...
How are the chances to integrate a class library with similar

functionality into the Python software infrastructure?

...
https://bugs.python.org/issue22957

Regards, Markus _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Guido van Rossum

3:40 p.m.

I don't know the first thing about Boost, but this sounds like a database. Perhaps you can use sqlite? On Mon, Dec 1, 2014 at 4:13 AM, SF Markus Elfring < elfring@users.sourceforge.net> wrote:

...

Hello,

I find a data structure like it is provided by the "Boost Multi-index Containers Library" interesting for efficient data processing. http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html

How are the chances to integrate a class library with similar functionality into the Python software infrastructure? https://bugs.python.org/issue22957

Regards, Markus _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido)

SF Markus Elfring

4:50 p.m.

...

I don't know the first thing about Boost, but this sounds like a database.

The summary mentions it also. http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/index.html "... The concept of multi-indexing over the same collection of elements is borrowed from relational database terminology and allows for the specification of complex data structures in the spirit of multiply indexed relational tables where simple sets and maps are not enough. ..."

...

Perhaps you can use sqlite?

I am reusing this software already as you can see here: Determination for the number of named function parameters (with SmPL) https://lkml.org/lkml/2014/12/1/184 Regards, Markus

3723

Age (days ago)

3730

Last active (days ago)

List overview

Download

12 comments

9 participants

participants (9)

Chris Angelico
David Townshend
David Wilson
Grant Jenks
Guido van Rossum
Jeff Hardy
SF Markus Elfring
Todd
Wes Turner

Multi-index Containers Library

SF Markus Elfring

SF Markus Elfring

David Townshend

SF Markus Elfring

SF Markus Elfring

tags

participants (9)