Tree as a data structure (Was: Graph class)

On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:
I think of graphs and trees as patterns, not data structures.
In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure. Even if it is a pattern, this pattern is vital for the transformation of structured data, because it allows to represent any data structure in canonical format. Speaking of tree as a data structure, I assume that it has a very basic definition: 1. tree consists of nodes 2. some nodes are containers for other nodes 3. every node has properties 4. every node has 0 or 1 parent 5. every container has 1+ children 6. tree has a single starting root node 7. no child of a parent can be its ancestor (no cyclic dependencies between elements) List of trees is a forest. Every subtree is a complete tree. To see which tree data type would be really useful in Python distribution (e.g. provides a simple, extendable and intuitive interface), I see only one way - is to scratching some itches relevant to Python and then try to scale it to other problems. The outcome should be the answer - what for native tree type is not suitable? More ideas: [ ] Much experience for working with trees can be brought from XML and DOM manipulation practices (jQuery and friends) [ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last' [ ] but it is also convenient to access tree data using node names as 'mylib.books[:1]' [ ] and of course, you can run queries over trees [ ] Tree is the base for any "Data Transformation Framework" as it allows to jump from "data type conversion" to "data structure conversion and mapping" [ ] Trees can be converted to other trees and to more complicated structures [ ] Conversion can be symmetrical and non-symmetrical [ ] Conversion can be lossy and lossless [ ] Conversion can be lossless and non-symmetrical at the same time Trees can be used, for example, for realtime migration of issues from one tracker to another. For managing changesets with additional meta information. For presenting package dependencies and working with them. For atomic (transactional) file management. For managing operating system capability information. For logging setup. For debugging structures in Python. For working and converting binary file formats. For the common AST transformation and query interface. For the understanding how 2to3 fixers work. For the common ground of visual representation, comparison and transformation of data structures. That's probably enough of my itches.

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:
On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:
I think of graphs and trees as patterns, not data structures.
In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.
Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.
Speaking of tree as a data structure, I assume that it has a very basic definition:
1. tree consists of nodes 2. some nodes are containers for other nodes
Are the leaves a different type, or just nodes that happen to have zero children at the moment?
3. every node has properties
What sort of properties? A single value of a given class, plus some binary flags that are internal to the graph implementation? A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?
More ideas:
[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'
That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases. (And of course, the same goes for the overhead of balancing, etc.) -jJ

Jim Jewett <jimjjewett@...> writes:
Do you care about the overhead of an OrderedDict? As long as you are not manipulating a huge amount of data, a generic tree structure such as provided by e.g. the networkx library is perfectly fine. And if you want to reimplement a more optimized structure, sure, that's fine. But that's not an argument against a generic data structure that would be sufficient for 99.9% of all use cases. Regards Antoine.

On Wed, Dec 19, 2012 at 7:38 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
Right. Creating a tree structure is not the problem. The problem arise when you have to study the code or work collaboratively with other developers. It takes time to see an ordinary namedtuple in the magic of some custom made tuple subclass. But you can easily add a comment that it is a reimplementation of namedtuple and the code immediately becomes clear. With trees it is impossible to add such a comment, because there is no known reference tree type you can refer to. Making a sum out this to go from patters vs structure. Patterns and data structures are interconnected. The absence of tree definition makes it really hard to communicate about the usage, potential and outcomes or particular approach between developers. What data structure or pattern do we need for <this certain case> - a tree, but which tree exactly and why?
For the 'reference tree' I'd choose the most common trees human beings work daily, can see and as a result - easily imagine. 1. leaves can not mutate into containers 2. container property structure is different from leaves structure, but may share elements Spoiler: This is a pattern or data structure of filesystem tree. I'd call a tree, which leaves can mutate into containers, a 'mutatable tree', and the one, where leaves are containers with 0 elements, a 'uniform tree' data structure name. A 'flexible tree` could be the better name, but it is too generic to draw a clear association to the behavior.
3. every node has properties
What sort of properties?
I've meant the user level properties, not internal required for maintaining tree structure.
A single value of a given class, plus some binary flags that are internal to the graph implementation?
I am afraid to become lost in the depths of implementation details, because it is where 2D concept jumps in. The 'reference tree' I mentioned above is a 1:1 mapping between the set of user level properties and a node. This means each container node is "assigned" one user level set of properties (the given class) and each leaf node contains another. It is the opposite to the tree, where each node can have different user class (set of properties) assigned. The 2nd dimension is the mapping between node types (leaf and container) and user level types.
For the 'reference tree' every leaf contains the same set of properties, each property has its own value. Every container has the different set of properties, each property has its own value. I can't say if should be implemented as a class, but I can only propose how this should behave: For example, I want to access filesystem with , the syntax is the following: for node in container.nodes(): if node is File: print node.name print node.hidden if node is Directory: print node.name + '/'
From the other side I want to access:
for file in directory.files(): print file.name print file.hidden The latter is more intuitive, but only possible if we can map 'files' accessor name to 'node.type == leaf' query (which is hardcoded for 'generic tree' implementation).
Maintaining data structure (order and nesting of elements) is the key concept for a generic tree, and it also helps in development when you need an easy way to "run a diff over it". Even for unordered children there should be some way to sort them out for the comparisons. One important operation over tree can be "data structure hash", which can be used to detect if the structure of some tree is equal to the given structure. For this operation the actual values of the properties are irrelevant. Only types, positions of the nodes and names of their properties. For the 'reference tree' we have 1:1 mapping between node type, and the user level type, so the type of the node is not relevant. If set of fields is fixed, it is not relevant too, so only the data structure - nesting and order of elements plays role. Actually, after rereading this sounds too abstract. When we compare the filesystem trees for identity, the name of the directory (container) is its address that participates in the hash, and the order of elements is irrelevant. When we compare two data structures that web framework passes to template engine, we also not interested in the order of first level key:value pairs, but the names of these keys are important. This is only the first level of the data structure, though, data structure for the values part can also be a tree, where the order is important. So, for the most generic comparison keys there should be a way to present unordered tree in ordered manner for hash comparison. == More ideas (feel free to skip the brain dump or split it into different thread) For a generic, filesystem-like tree I need to iterate over the lees in specified container, over containers there and over both leaves and containers. I want to choose the failure behavior when I iterate over the non-existing node property. And if given the default choice, I prefer to avoid exception if possible. If field doesn't exist, return None. If field doesn't have a value, supply an Empty class. In the data structure the 'None' is not a value, but a fact, that there is no field in a data structure. Why avoid exceptions? Exception is like an emergency procedure where you lose the jet and can non resume the flight from the point you've stopped. You need to supply the parachute beforehand and make sure it fits in the structure of your cabin. I mean that it is very hard to resume processing after the exception if you're interrupted in the middle of a cycle. The exceptions will occur anyway, but for the first iteration I'd like to see exception-less data structure handling, using None semantics for absent properties. It will also make check for field existence more consistent. Instead of "if property.__name__ in node.__dict__" or even instead of "if property in node" use "if node.property != None", because the latter is not easy to confuse with "if node in container". Another concept if the set of properties should be fixed or expandable for a given node instance in a 'reference tree'. For flexibility I like the latter, but for the static analysis in IDE, it is better to get a warning early when you assign a value to non-existing tree node property.

On 12/19/12, anatoly techtonik <techtonik@gmail.com> wrote:
On Sun, Dec 16, 2012 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:
I think of graphs and trees as patterns, not data structures.
In my world strings, ints and lists are 1D data types, and tree can be a very important 2D data structure.
Yes; the catch is that the details of that data structure will differ depending on the problem. Most problems do not need the fancy algorithms -- or the extra overhead that supports them. Since a simple tree (or graph) is easy to write, and the fiddly details are often -- but not always -- wasted overhead, it doesn't make sense to designate a single physical structure as "the" tree (or graph) representation. So it stays a pattern, rather than a concrete data structure.
Speaking of tree as a data structure, I assume that it has a very basic definition:
1. tree consists of nodes 2. some nodes are containers for other nodes
Are the leaves a different type, or just nodes that happen to have zero children at the moment?
3. every node has properties
What sort of properties? A single value of a given class, plus some binary flags that are internal to the graph implementation? A fixed set of values that occur on every node? (Possibly differing between leaves and regular nodes?) A fixed value (used for ordering) plus an arbitrary collection that can vary by node?
More ideas:
[ ] every element in a tree can be accessed by its address specificator as 'root/node[3]/last'
That assumes an arbitrary number of children, and that the children are ordered. A sensible choice, but it adds way too much overhead for some cases. (And of course, the same goes for the overhead of balancing, etc.) -jJ

Jim Jewett <jimjjewett@...> writes:
Do you care about the overhead of an OrderedDict? As long as you are not manipulating a huge amount of data, a generic tree structure such as provided by e.g. the networkx library is perfectly fine. And if you want to reimplement a more optimized structure, sure, that's fine. But that's not an argument against a generic data structure that would be sufficient for 99.9% of all use cases. Regards Antoine.

On Wed, Dec 19, 2012 at 7:38 PM, Jim Jewett <jimjjewett@gmail.com> wrote:
Right. Creating a tree structure is not the problem. The problem arise when you have to study the code or work collaboratively with other developers. It takes time to see an ordinary namedtuple in the magic of some custom made tuple subclass. But you can easily add a comment that it is a reimplementation of namedtuple and the code immediately becomes clear. With trees it is impossible to add such a comment, because there is no known reference tree type you can refer to. Making a sum out this to go from patters vs structure. Patterns and data structures are interconnected. The absence of tree definition makes it really hard to communicate about the usage, potential and outcomes or particular approach between developers. What data structure or pattern do we need for <this certain case> - a tree, but which tree exactly and why?
For the 'reference tree' I'd choose the most common trees human beings work daily, can see and as a result - easily imagine. 1. leaves can not mutate into containers 2. container property structure is different from leaves structure, but may share elements Spoiler: This is a pattern or data structure of filesystem tree. I'd call a tree, which leaves can mutate into containers, a 'mutatable tree', and the one, where leaves are containers with 0 elements, a 'uniform tree' data structure name. A 'flexible tree` could be the better name, but it is too generic to draw a clear association to the behavior.
3. every node has properties
What sort of properties?
I've meant the user level properties, not internal required for maintaining tree structure.
A single value of a given class, plus some binary flags that are internal to the graph implementation?
I am afraid to become lost in the depths of implementation details, because it is where 2D concept jumps in. The 'reference tree' I mentioned above is a 1:1 mapping between the set of user level properties and a node. This means each container node is "assigned" one user level set of properties (the given class) and each leaf node contains another. It is the opposite to the tree, where each node can have different user class (set of properties) assigned. The 2nd dimension is the mapping between node types (leaf and container) and user level types.
For the 'reference tree' every leaf contains the same set of properties, each property has its own value. Every container has the different set of properties, each property has its own value. I can't say if should be implemented as a class, but I can only propose how this should behave: For example, I want to access filesystem with , the syntax is the following: for node in container.nodes(): if node is File: print node.name print node.hidden if node is Directory: print node.name + '/'
From the other side I want to access:
for file in directory.files(): print file.name print file.hidden The latter is more intuitive, but only possible if we can map 'files' accessor name to 'node.type == leaf' query (which is hardcoded for 'generic tree' implementation).
Maintaining data structure (order and nesting of elements) is the key concept for a generic tree, and it also helps in development when you need an easy way to "run a diff over it". Even for unordered children there should be some way to sort them out for the comparisons. One important operation over tree can be "data structure hash", which can be used to detect if the structure of some tree is equal to the given structure. For this operation the actual values of the properties are irrelevant. Only types, positions of the nodes and names of their properties. For the 'reference tree' we have 1:1 mapping between node type, and the user level type, so the type of the node is not relevant. If set of fields is fixed, it is not relevant too, so only the data structure - nesting and order of elements plays role. Actually, after rereading this sounds too abstract. When we compare the filesystem trees for identity, the name of the directory (container) is its address that participates in the hash, and the order of elements is irrelevant. When we compare two data structures that web framework passes to template engine, we also not interested in the order of first level key:value pairs, but the names of these keys are important. This is only the first level of the data structure, though, data structure for the values part can also be a tree, where the order is important. So, for the most generic comparison keys there should be a way to present unordered tree in ordered manner for hash comparison. == More ideas (feel free to skip the brain dump or split it into different thread) For a generic, filesystem-like tree I need to iterate over the lees in specified container, over containers there and over both leaves and containers. I want to choose the failure behavior when I iterate over the non-existing node property. And if given the default choice, I prefer to avoid exception if possible. If field doesn't exist, return None. If field doesn't have a value, supply an Empty class. In the data structure the 'None' is not a value, but a fact, that there is no field in a data structure. Why avoid exceptions? Exception is like an emergency procedure where you lose the jet and can non resume the flight from the point you've stopped. You need to supply the parachute beforehand and make sure it fits in the structure of your cabin. I mean that it is very hard to resume processing after the exception if you're interrupted in the middle of a cycle. The exceptions will occur anyway, but for the first iteration I'd like to see exception-less data structure handling, using None semantics for absent properties. It will also make check for field existence more consistent. Instead of "if property.__name__ in node.__dict__" or even instead of "if property in node" use "if node.property != None", because the latter is not easy to confuse with "if node in container". Another concept if the set of properties should be fixed or expandable for a given node instance in a 'reference tree'. For flexibility I like the latter, but for the static analysis in IDE, it is better to get a warning early when you assign a value to non-existing tree node property.
participants (3)
-
anatoly techtonik
-
Antoine Pitrou
-
Jim Jewett