Mailman 3 Tree as a data structure (Was: Graph class) - Python-ideas

Jim Jewett

December 2012

4:38 p.m.

New subject: Tree as a data structure (Was: Graph class)

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:

...

On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:

...

...
I think of graphs and trees as patterns, not data structures.

...

In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.

Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.

...

Speaking of tree as a data structure, I assume that it has a very basic definition:

...

1. tree consists of nodes 2. some nodes are containers for other nodes

Are the leaves a different type, or just nodes that happen to have zero children at the moment?

...

3. every node has properties

What sort of properties? A single value of a given class, plus some binary flags that are internal to the graph implementation? A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?

...

More ideas:

...

[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'

That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases. (And of course, the same goes for the overhead of balancing, etc.) -jJ

Reply

Sign in to reply online Use email software

anatoly techtonik

4:21 p.m.

New subject: Tree as a data structure (Was: Graph class)

On Wed, Dec 19, 2012 at 7:38 PM, Jim Jewett <jimjjewett@gmail.com> wrote:

...

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:

...
On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:

...
...
I think of graphs and trees as patterns, not data structures.

...
In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.

Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.

Right. Creating a tree structure is not the problem. The problem arise when you have to study the code or work collaboratively with other developers. It takes time to see an ordinary namedtuple in the magic of some custom made tuple subclass. But you can easily add a comment that it is a reimplementation of namedtuple and the code immediately becomes clear. With trees it is impossible to add such a comment, because there is no known reference tree type you can refer to. Making a sum out this to go from patters vs structure. Patterns and data structures are interconnected. The absence of tree definition makes it really hard to communicate about the usage, potential and outcomes or particular approach between developers. What data structure or pattern do we need for <this certain case> - a tree, but which tree exactly and why?

...

...
Speaking of tree as a data structure, I assume that it has a very basic definition:

...
1. tree consists of nodes 2. some nodes are containers for other nodes

Are the leaves a different type, or just nodes that happen to have zero children at the moment?

For the 'reference tree' I'd choose the most common trees human beings work daily, can see and as a result - easily imagine. 1. leaves can not mutate into containers 2. container property structure is different from leaves structure, but may share elements Spoiler: This is a pattern or data structure of filesystem tree. I'd call a tree, which leaves can mutate into containers, a 'mutatable tree', and the one, where leaves are containers with 0 elements, a 'uniform tree' data structure name. A 'flexible tree` could be the better name, but it is too generic to draw a clear association to the behavior.

...

...
3. every node has properties

What sort of properties?

I've meant the user level properties, not internal required for maintaining tree structure.

...

A single value of a given class, plus some binary flags that are internal to the graph implementation?

I am afraid to become lost in the depths of implementation details, because it is where 2D concept jumps in. The 'reference tree' I mentioned above is a 1:1 mapping between the set of user level properties and a node. This means each container node is "assigned" one user level set of properties (the given class) and each leaf node contains another. It is the opposite to the tree, where each node can have different user class (set of properties) assigned. The 2nd dimension is the mapping between node types (leaf and container) and user level types.

...

A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?

For the 'reference tree' every leaf contains the same set of properties, each property has its own value. Every container has the different set of properties, each property has its own value. I can't say if should be implemented as a class, but I can only propose how this should behave: For example, I want to access filesystem with , the syntax is the following: for node in container.nodes(): if node is File: print node.name print node.hidden if node is Directory: print node.name + '/'

...

From the other side I want to access:

for file in directory.files(): print file.name print file.hidden The latter is more intuitive, but only possible if we can map 'files' accessor name to 'node.type == leaf' query (which is hardcoded for 'generic tree' implementation).

...

More ideas:

...
[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'

That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases.

(And of course, the same goes for the overhead of balancing, etc.)

Maintaining data structure (order and nesting of elements) is the key concept for a generic tree, and it also helps in development when you need an easy way to "run a diff over it". Even for unordered children there should be some way to sort them out for the comparisons. One important operation over tree can be "data structure hash", which can be used to detect if the structure of some tree is equal to the given structure. For this operation the actual values of the properties are irrelevant. Only types, positions of the nodes and names of their properties. For the 'reference tree' we have 1:1 mapping between node type, and the user level type, so the type of the node is not relevant. If set of fields is fixed, it is not relevant too, so only the data structure - nesting and order of elements plays role. Actually, after rereading this sounds too abstract. When we compare the filesystem trees for identity, the name of the directory (container) is its address that participates in the hash, and the order of elements is irrelevant. When we compare two data structures that web framework passes to template engine, we also not interested in the order of first level key:value pairs, but the names of these keys are important. This is only the first level of the data structure, though, data structure for the values part can also be a tree, where the order is important. So, for the most generic comparison keys there should be a way to present unordered tree in ordered manner for hash comparison. == More ideas (feel free to skip the brain dump or split it into different thread) For a generic, filesystem-like tree I need to iterate over the lees in specified container, over containers there and over both leaves and containers. I want to choose the failure behavior when I iterate over the non-existing node property. And if given the default choice, I prefer to avoid exception if possible. If field doesn't exist, return None. If field doesn't have a value, supply an Empty class. In the data structure the 'None' is not a value, but a fact, that there is no field in a data structure. Why avoid exceptions? Exception is like an emergency procedure where you lose the jet and can non resume the flight from the point you've stopped. You need to supply the parachute beforehand and make sure it fits in the structure of your cabin. I mean that it is very hard to resume processing after the exception if you're interrupted in the middle of a cycle. The exceptions will occur anyway, but for the first iteration I'd like to see exception-less data structure handling, using None semantics for absent properties. It will also make check for field existence more consistent. Instead of "if property.__name__ in node.__dict__" or even instead of "if property in node" use "if node.property != None", because the latter is not easy to confuse with "if node in container". Another concept if the set of properties should be fixed or expandable for a given node instance in a 'reference tree'. For flexibility I like the latter, but for the static analysis in IDE, it is better to get a warning early when you assign a value to non-existing tree node property.

Reply

Sign in to reply online Use email software

Jim Jewett

December 2012

4:38 p.m.

New subject: Tree as a data structure (Was: Graph class)

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:

...

On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:

...

...
I think of graphs and trees as patterns, not data structures.

...

In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.

Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.

...

Speaking of tree as a data structure, I assume that it has a very basic definition:

...

1. tree consists of nodes 2. some nodes are containers for other nodes

Are the leaves a different type, or just nodes that happen to have zero children at the moment?

...

3. every node has properties

What sort of properties? A single value of a given class, plus some binary flags that are internal to the graph implementation? A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?

...

More ideas:

...

[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'

That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases. (And of course, the same goes for the overhead of balancing, etc.) -jJ

Reply

Sign in to reply online Use email software

anatoly techtonik

4:21 p.m.

New subject: Tree as a data structure (Was: Graph class)

On Wed, Dec 19, 2012 at 7:38 PM, Jim Jewett <jimjjewett@gmail.com> wrote:

...

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:

...
On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:

...
...
I think of graphs and trees as patterns, not data structures.

...
In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.

Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.

Right. Creating a tree structure is not the problem. The problem arise when you have to study the code or work collaboratively with other developers. It takes time to see an ordinary namedtuple in the magic of some custom made tuple subclass. But you can easily add a comment that it is a reimplementation of namedtuple and the code immediately becomes clear. With trees it is impossible to add such a comment, because there is no known reference tree type you can refer to. Making a sum out this to go from patters vs structure. Patterns and data structures are interconnected. The absence of tree definition makes it really hard to communicate about the usage, potential and outcomes or particular approach between developers. What data structure or pattern do we need for <this certain case> - a tree, but which tree exactly and why?

...

...
Speaking of tree as a data structure, I assume that it has a very basic definition:

...
1. tree consists of nodes 2. some nodes are containers for other nodes

Are the leaves a different type, or just nodes that happen to have zero children at the moment?

For the 'reference tree' I'd choose the most common trees human beings work daily, can see and as a result - easily imagine. 1. leaves can not mutate into containers 2. container property structure is different from leaves structure, but may share elements Spoiler: This is a pattern or data structure of filesystem tree. I'd call a tree, which leaves can mutate into containers, a 'mutatable tree', and the one, where leaves are containers with 0 elements, a 'uniform tree' data structure name. A 'flexible tree` could be the better name, but it is too generic to draw a clear association to the behavior.

...

...
3. every node has properties

What sort of properties?

I've meant the user level properties, not internal required for maintaining tree structure.

...

A single value of a given class, plus some binary flags that are internal to the graph implementation?

I am afraid to become lost in the depths of implementation details, because it is where 2D concept jumps in. The 'reference tree' I mentioned above is a 1:1 mapping between the set of user level properties and a node. This means each container node is "assigned" one user level set of properties (the given class) and each leaf node contains another. It is the opposite to the tree, where each node can have different user class (set of properties) assigned. The 2nd dimension is the mapping between node types (leaf and container) and user level types.

...

A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?

For the 'reference tree' every leaf contains the same set of properties, each property has its own value. Every container has the different set of properties, each property has its own value. I can't say if should be implemented as a class, but I can only propose how this should behave: For example, I want to access filesystem with , the syntax is the following: for node in container.nodes(): if node is File: print node.name print node.hidden if node is Directory: print node.name + '/'

...

From the other side I want to access:

for file in directory.files(): print file.name print file.hidden The latter is more intuitive, but only possible if we can map 'files' accessor name to 'node.type == leaf' query (which is hardcoded for 'generic tree' implementation).

...

More ideas:

...
[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'

That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases.

(And of course, the same goes for the overhead of balancing, etc.)

Maintaining data structure (order and nesting of elements) is the key concept for a generic tree, and it also helps in development when you need an easy way to "run a diff over it". Even for unordered children there should be some way to sort them out for the comparisons. One important operation over tree can be "data structure hash", which can be used to detect if the structure of some tree is equal to the given structure. For this operation the actual values of the properties are irrelevant. Only types, positions of the nodes and names of their properties. For the 'reference tree' we have 1:1 mapping between node type, and the user level type, so the type of the node is not relevant. If set of fields is fixed, it is not relevant too, so only the data structure - nesting and order of elements plays role. Actually, after rereading this sounds too abstract. When we compare the filesystem trees for identity, the name of the directory (container) is its address that participates in the hash, and the order of elements is irrelevant. When we compare two data structures that web framework passes to template engine, we also not interested in the order of first level key:value pairs, but the names of these keys are important. This is only the first level of the data structure, though, data structure for the values part can also be a tree, where the order is important. So, for the most generic comparison keys there should be a way to present unordered tree in ordered manner for hash comparison. == More ideas (feel free to skip the brain dump or split it into different thread) For a generic, filesystem-like tree I need to iterate over the lees in specified container, over containers there and over both leaves and containers. I want to choose the failure behavior when I iterate over the non-existing node property. And if given the default choice, I prefer to avoid exception if possible. If field doesn't exist, return None. If field doesn't have a value, supply an Empty class. In the data structure the 'None' is not a value, but a fact, that there is no field in a data structure. Why avoid exceptions? Exception is like an emergency procedure where you lose the jet and can non resume the flight from the point you've stopped. You need to supply the parachute beforehand and make sure it fits in the structure of your cabin. I mean that it is very hard to resume processing after the exception if you're interrupted in the middle of a cycle. The exceptions will occur anyway, but for the first iteration I'd like to see exception-less data structure handling, using None semantics for absent properties. It will also make check for field existence more consistent. Instead of "if property.__name__ in node.__dict__" or even instead of "if property in node" use "if node.property != None", because the latter is not easy to confuse with "if node in container". Another concept if the set of properties should be fixed or expandable for a given node instance in a 'reference tree'. For flexibility I like the latter, but for the static analysis in IDE, it is better to get a warning early when you assign a value to non-existing tree node property.

Reply

Sign in to reply online Use email software

Tree as a data structure (Was: Graph class)

anatoly techtonik

Jim Jewett

Antoine Pitrou

anatoly techtonik

Jim Jewett

Antoine Pitrou

anatoly techtonik

tags

participants (3)