[Python-Dev] Google Summer of Code proposal: New class for work with binary trees AVL and RB as with the standard dictionary.

Vladimir 'Yu' Stepanov vys at renet.ru
Mon Apr 24 11:51:58 CEST 2006


I would like to participate in Google Summer of Code. The idea consists in
creation of a specialized class for work with binary trees AVL and RB. The
part of ideas is taken from Parrot (Perl6) where for pair values the
specialized type is stipulated. As advantages it is possible to note:
  * High speed. On simple types at quantity of elements up to 10000 speed of
    work concedes to the standard dictionary of all in 2.7 times.
  * Safety of work in a multithread operating mode.
  * The method for frosts of object is stipulated.
  * The data storage is carried out in the sorted kind.
  * A high overall performance `item' and `iteritem' methods as it is not
    necessary to create new objects.
  * At lots of objects in a tree memory is used much less intensively.
  * Absence of collisions. As consequence, it is impossible to generate bad
    data for creation DoS of the attacks influencing the dictionary of
    transferred arguments.
  * For objects existence of a method __ hash __ is not necessary.
  * The same kind of the dictionary by means of the overloaded operations
    probably change of its behaviour so that it supported the 
multivariational
    representation based on stratification given and subsequent recoils.

Additional requirements to key objects:
  * The opportunity of comparison of key objects function cmp is necessary.

There are the reasons, braking development of this project:
  * Compulsion of heading of a module `gc' for each compound object (this
    structure is unessential to some objects, but updating `gc' the module
    is required).
  * Absence of standard object the pair, probably depends on the 
previous item.

Lacks of a binary tree:
  * Average time of search for hash - O (1), and for trees - O (log2 N).
  * A lot of memory under a single element of a tree:
    (3*sizeof (void *) + sizeof (int))*2,
    one element is used rather - the pair, the second site of memory is
    allocated under node of a tree.

In protection of object "pair":
  * The logic of methods of the given object noticeably is easier than 
tuple,
    that as a result can affect speed of work of the program in the best 
party.
  * Alignment on 8 bytes has no big sense at present architecture where in
    cache sample a minimum on 64 bytes is made. Use of type "pair" will give
    an appreciable prize at use of alignment in 4 bytes. Otherwise on 64-bit
    platforms it is much more favourable to use tuple object.

The given project can demand processing of the module `gcmodule.c' and
`tupleobject.c'. It is necessary to reduce the size of static objects, 
for this
purpose the opportunity is necessary is transparent to pass objects not 
having
direct support from the module `gcmodule.c'.

Also it will be partially necessary to process the module `obmalloc.c' 
for more
effective distribution of memory.

I shall be glad to answer questions on this theme.



More information about the Python-Dev mailing list