Mailman 3 Adding a built-in data structure with binary search tree semantics - Python-ideas

15 Mar 2020

      In other languages, there are built-in data structures with binary search tree semantics, as an example, std::map in C++.

By binary search tree semantics we mean the following:

* We can insert elements into the structure in time faster than O(n), i.e. O(logn) for std::map
* We can iterate over all of the elements of the structure in sorted order in linear time, i.e. O(n) for std::map

Is there currently any data structure in Python with binary search tree semantics?

To illustrate the need, can the following program be written elegantly and efficiently with built-in Python data structures?

Imagine that you are a fund manager at a bank. Bank accounts are added to your fund very frequently. A bank account has an ID and a value. A smaller ID means that the bank account was created earlier than one with a larger ID. Each time that an account is added to your fund, you would like to know the ID number of the 1000th oldest account, for your records.

Here are implementations in Python 3 and C++, for comparison:

```python3
from random import randint

dist = lambda: randint(0, 2147483647)

def main():
    my_map = {}

    # fills a map with 1000000 random mappings of type (int, int)
    for i in range(0, 1000000):
        my_map[dist()] = dist()

    # prints out the 1000th smallest key and its value
    target_key = nth_smallest_key(1000, my_map)
    print("({}: {})".format(target_key, my_map[target_key]))

    # writes a new random mapping to the map
    # then prints out the 1000th smallest key and its value if that key
    # has changed
    # 100000 times
    for i in range(100000):
        my_map[dist()] = dist()

        test_key = nth_smallest_key(1000, my_map)
        if target_key != test_key:
            target_key = test_key
            print("({}: {})".format(target_key, my_map[target_key]))

        # print an indicator every 100 iterations for comparison
        if i % 100 == 0:
            print("iteration: {}".format(i))

def nth_smallest_key(n, m):
    return sorted(m.keys())[n]

if __name__ == "__main__":
    main()
```

```c++
#include <cstdio>
#include <map>
#include <random>
using namespace std;

int nth_smallest_key(int n, map& m);

int main() {
    random_device rd;
    mt19937 psrng(rd());
    uniform_int_distribution<> dist(0, 2147483647);

    map my_map;

    // fills a map with 1000000 random mappings of type (<int>: <int>)
    for (int i = 0; i < 1000000; i++) {
        my_map[dist(psrng)] = dist(psrng);
    }

    // prints out the 1000th smallest key and its value
    int target_key = nth_smallest_key(1000, my_map);
    printf("(%d: %d)\n", target_key, my_map[target_key]);

    // writes a new random mapping to the map
    // then prints out the 1000th smallest key and its value if that key
    // has changed
    // 100000 times
    for (int i = 0; i <= 100000; i++) {
        my_map[dist(psrng)] = dist(psrng);

        int test_key = nth_smallest_key(1000, my_map);
        if (target_key != test_key) {
            target_key = test_key;
            printf("(%d: %d)\n", target_key, my_map[target_key]);
        }
    }

    return 0;
}

int nth_smallest_key(int n, map& m) {
    map::iterator curr = m.begin();
    for (int i = 0; i < n; i++) {
        curr = next(curr);
    }
    return curr->first;
}
```

Makefile:
```make
runcpp: buildcpp
        ./main

buildcpp:
        g++ -o main bst_semantics_cpp.cpp

runpython:
        python3 bst_semantics_python.py
```

The C++ program runs on my machine in approximately 5 seconds
```bash
$ time ./main
(2211193: 2021141747)
(2208771: 1079444337)
(2208700: 1187119077)
(2208378: 1447743503)
...
(1996019: 1378401665)
(1989217: 1457497754)
(1989042: 1336229409)

real    0m4.915s
user    0m4.750s
sys     0m0.094s
$ 
```

The Python program does not reach 1% of completion after 120 seconds
```bash
$ time make runpython
python3 bst_semantics_python.py
(2158070: 1498305903)
iteration: 0
iteration: 100
iteration: 200
^CTraceback (most recent call last):
  File "bst_semantics_python.py", line 36, in <module>
    main()
  File "bst_semantics_python.py", line 23, in main
    test_key = nth_smallest_key(1000, my_map)
  File "bst_semantics_python.py", line 33, in nth_smallest_key
    return sorted(m.keys())[n]
KeyboardInterrupt
Makefile:8: recipe for target 'runpython' failed
make: *** [runpython] Error 1

real    2m2.040s
user    1m59.063s
sys     0m0.375s
$ 
```

I would like to propose adding such a data structure to the Python standard library. It could be called a SortedDict.

Adding a built-in data structure with binary search tree semantics

1njection1＠gmail.com

Dan Sommers

Christopher Barker

Alex Hall

Christopher Barker

Chris Angelico

Christopher Barker

Soni L.

Marco Sulla

Andrew Barnert

Ben Rudiak-Gould

Andrew Barnert

Ben Rudiak-Gould

Christopher Barker

Andrew Barnert

Christopher Barker

Andrew Barnert

Christopher Barker

Marco Sulla

Andrew Barnert

Steven D'Aprano

Marco Sulla

Barry

Andrew Barnert

tags

participants (11)