[Python-ideas] Tree as a data structure (Was: Graph class)

Sun Dec 23 17:21:43 CET 2012

On Wed, Dec 19, 2012 at 7:38 PM, Jim Jewett <jimjjewett at gmail.com> wrote:

> On 12/19/12, anatoly techtonik <techtonik at gmail.com> wrote:
> > On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido at python.org>
> wrote:
>
> >> I think of graphs and trees as patterns, not data structures.
>
> > In my world strings, ints and lists are 1D data types, and tree can be a
> > very important 2D data structure.
>
> Yes; the catch is that the details of that data structure will differ
> depending on the problem.  Most problems do not need the fancy
> algorithms -- or the extra overhead that supports them.  Since a
> simple tree (or graph) is easy to write, and the fiddly details are
> often -- but not always -- wasted overhead, it doesn't make sense to
> designate a single physical structure as "the" tree (or graph)
> representation.  So it stays a pattern, rather than a concrete data
> structure.

Right. Creating a tree structure is not the problem. The problem arise when
you have to study the code or work collaboratively with other developers.
It takes time to see an ordinary namedtuple in the magic of some custom
made tuple subclass. But you can easily add a comment that it is a
reimplementation of namedtuple and the code immediately becomes clear.

With trees it is impossible to add such a comment, because there is no
known reference tree type you can refer to.

Making a sum out this to go from patters vs structure. Patterns and data
structures are interconnected. The absence of tree definition makes it
really hard to communicate about the usage, potential and outcomes or
particular approach between developers. What data structure or pattern do
we need for <this certain case> - a tree, but which tree exactly and why?

> > Speaking of tree as a data structure, I assume that it has a very basic
> > definition:
>
> > 1. tree consists of nodes
> > 2. some nodes are containers for other nodes
>
> Are the leaves a different type, or just nodes that happen to have
> zero children at the moment?

For the 'reference tree' I'd choose the most common trees human beings work
daily, can see and as a result - easily imagine.
1. leaves can not mutate into containers
2. container property structure is different from leaves structure, but may
share elements

Spoiler: This is a pattern or data structure of filesystem tree.

I'd call a tree, which leaves can mutate into containers, a 'mutatable
tree', and the one, where leaves are containers with 0 elements, a 'uniform
tree' data structure name. A 'flexible tree` could be the better name, but
it is too generic to draw a clear association to the behavior.

> > 3. every node has properties
>
> What sort of properties?
>

I've meant the user level properties, not internal required for maintaining
tree structure.

> A single value of a given class, plus some binary flags that are
> internal to the graph implementation?
>

I am afraid to become lost in the depths of implementation details, because
it is where 2D concept jumps in.
The 'reference tree' I mentioned above is a 1:1 mapping between the set
of user level properties and a node. This means each container node is
"assigned" one user level set of properties (the given class) and each leaf
node contains another. It is the opposite to the tree, where each node can
have different user class (set of properties) assigned. The 2nd dimension
is the mapping between node types (leaf and container) and user level types.

> A fixed set of values that occur on every node?  (Possibly differing
> between leaves and regular nodes?)
> A fixed value (used for ordering) plus an arbitrary collection that
> can vary by node?
>

For the 'reference tree' every leaf contains the same set of properties,
each property has its own value.
Every container has the different set of properties, each property has its
own value.
I can't say if should be implemented as a class, but I can only propose how
this should behave:

For example, I want to access filesystem with , the syntax is the following:

  for node in container.nodes():
    if node is File:
      print node.name
      print node.hidden
    if node is Directory:
      print node.name + '/'

>From the other side I want to access:

  for file in directory.files():
    print file.name
    print file.hidden

The latter is more intuitive, but only possible if we can map 'files'
accessor name to 'node.type == leaf' query (which is hardcoded for 'generic
tree' implementation).

> More ideas:
>
> >   [ ] every element in a tree can be accessed by its address specificator
> > as 'root/node[3]/last'
>
> That assumes an arbitrary number of children, and that the children
> are ordered.  A sensible choice, but it adds way too much overhead for
> some cases.
>
> (And of course, the same goes for the overhead of balancing, etc.)

Maintaining data structure (order and nesting of elements) is the key
concept for a generic tree, and it also helps in development when you need
an easy way to "run a diff over it". Even for unordered children there
should be some way to sort them out for the comparisons.

One important operation over tree can be "data structure hash", which can
be used to detect if the structure of some tree is equal to the given
structure. For this operation the actual values of the properties are
irrelevant. Only types, positions of the nodes and names of their
properties. For the 'reference tree' we have 1:1 mapping between node type,
and the user level type, so the type of the node is not relevant. If set of
fields is fixed, it is not relevant too, so only the data structure -
nesting and order of elements plays role.

Actually, after rereading this sounds too abstract. When we compare the
filesystem trees for identity, the name of the directory (container) is its
address that participates in the hash, and the order of elements is
irrelevant.

When we compare two data structures that web framework passes to template
engine, we also not interested in the order of first level key:value pairs,
but the names of these keys are important. This is only the first level of
the data structure, though, data structure for the values part can also be
a tree, where the order is important. So, for the most generic comparison
keys there should be a way to present unordered tree in ordered manner for
hash comparison.

== More ideas (feel free to skip the brain dump or split it into different
thread)

For a generic, filesystem-like tree I need to iterate over the lees in
specified container, over containers there and over both leaves and
containers. I want to choose the failure behavior when I iterate over the
non-existing node property. And if given the default choice, I prefer to
avoid exception if possible. If field doesn't exist, return None. If field
doesn't have a value, supply an Empty class. In the data structure the
'None' is not a value, but a fact, that there is no field in a data
structure.

Why avoid exceptions? Exception is like an emergency procedure where you
lose the jet and can non resume the flight from the point you've stopped.
You need to supply the parachute beforehand and make sure it fits in the
structure of your cabin. I mean that it is very hard to resume processing
after the exception if you're interrupted in the middle of a cycle. The
exceptions will occur anyway, but for the first iteration I'd like to see
exception-less data structure handling, using None semantics for absent
properties.

It will also make check for field existence more consistent. Instead of "if
property.__name__ in node.__dict__" or even instead of "if property in
node" use "if node.property != None", because the latter is not easy to
confuse with "if node in container".

Another concept if the set of properties should be fixed or expandable for
a given node instance in a 'reference tree'. For flexibility I like the
latter, but for the static analysis in IDE, it is better to get a warning
early when you assign a value to non-existing tree node property.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121223/fb6c80aa/attachment.html>