Mailman 3 Standard tool for iterating over recursive data structures? - Python-ideas

newer
A PEP to encourage people to mind...

Standard tool for iterating over recursive data structures?

Steven D'Aprano

1 Jan 2021 1 Jan '21

5:34 a.m.

This isn't so much an idea for Python, as a request for ideas to solve a problem in Python. Back in the early days of Python, printing recursive lists could crash the interpreter: a = [1, 2, 3] a.append(a) print(a) If you try that today, you get a nice display: [1, 2, 3, [...]] It's not just lists that we can have this. Dicts and nearly any object can too: d = {} d[None] = d class Demo: pass x = Demo() x.attr = x With a bit of jiggery-pokery, even tuples can be recursive: t = (1, 2, 3, []) t[3].append(t) The built-ins handle these cases okay. Likewise the copy.deepcopy function takes care to avoid getting stuck in a loop with recursive data structures. On the other hand, when I write code to process nested data structures, I always worry about such recursive patterns, but actually doing something about it is too hard. So I just ignore the risk and feel guilty about it. I'm sure I'm not the only one who feels guilty about ignoring this problem. To say nothing of those who don't even know there's a problem to worry about. Is there something we can do to make it easier for people to deal with recursive data structures, like the list repr and deepcopy code does? Is there a pattern we can abstract out and provide as a tool or recipe for people to use in their own code? Relevant: https://bugs.python.org/issue42801 -- Steve

Show replies by date

Marco Sulla

1 Jan 1 Jan

2:19 p.m.

On Fri, 1 Jan 2021 at 06:38, Steven D'Aprano wrote:

...

Relevant: https://bugs.python.org/issue42801

Can't reproduce on the latest trunk (3.10). I get 1989 as a result.

Stestagg

2:32 p.m.

On Fri, Jan 1, 2021 at 2:21 PM Marco Sulla wrote:

...

On Fri, 1 Jan 2021 at 06:38, Steven D'Aprano wrote:

...
Relevant: https://bugs.python.org/issue42801

Can't reproduce on the latest trunk (3.10). I get 1989 as a result.

There were a spate of bugs about this, and was fixed by https://github.com/python/cpython/pull/23744 quite recently _______________________________________________

...

Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2FFECD... Code of Conduct: http://python.org/psf/codeofconduct/

2QdxY4RzWzUUiLuE＠potatochowder.com

2:20 p.m.

On 2021-01-01 at 16:34:15 +1100, Steven D'Aprano wrote:

...

This isn't so much an idea for Python, as a request for ideas to solve a problem in Python.

[examples of recurive data snipped]

...

The built-ins handle these cases okay. Likewise the copy.deepcopy function takes care to avoid getting stuck in a loop with recursive data structures.

The POSIX utilities tar and find (and I'm sure there are more) also deal with this.

...

On the other hand, when I write code to process nested data structures, I always worry about such recursive patterns, but actually doing something about it is too hard. So I just ignore the risk and feel guilty about it. I'm sure I'm not the only one who feels guilty about ignoring this problem.

Sorry about the guilt. Know your requirements! Specify your constraints and limitations! Document, document, document! But you know that stuff already.

...

To say nothing of those who don't even know there's a problem to worry about.

*sigh*

...

Is there something we can do to make it easier for people to deal with recursive data structures, like the list repr and deepcopy code does? Is there a pattern we can abstract out and provide as a tool or recipe for people to use in their own code?

My first thought was that all of those things (printing a list, navigating a tree^H^H^H^H cyclic graph, deepcopying data, etc.) are the recipes, and that every use case is different, but that's kind of a cop out. If I had to "abstract out" a pattern or an algorithm, it might look something like this: 1. initialize been_there to an empty set 2. initialize go_here to a collection containing the root of the data (e.g., go_here = [root]) 3. if go_here is empty, then stop 4. remove one element from go_here, assign it to current_node 5. if current_node is in been_there, go back to step 3 6. deal with current_node 7. add current_node to been_there 8. add each of current_node's "next node"s to go_here 9. go back to step 3 IOW, keep track of where you've been, and don't go there again. I probably should say out loud that go_here should contain the id of nodes rather than references to nodes (to avoid having to know too much about how to determine equality amongst nodes), but I was trying to stay away from too many Python- or implementation- specific details. Yes, it takes additional time and space not to dump core. No, you don't have to do this if you're certain that your data isn't recursive (even if your data structure is). But that seems too obvious and (at least at this level) too simple. It's almost right out of any graph (vertexes and edges) library that handles non-acyclic graphs, directed or not. So maybe it doesn't provide for a pretty "..." at the right place in the report. What am I missing?

Jonathan Fine

2:50 p.m.

Hi I'd say the problem isn't recursion. Here I'm using the definitions given in: https://en.wikipedia.org/wiki/Recursion https://www.cs.utah.edu/~germain/PPS/Topics/recursion.html Rather, it's that the data has a loop (or cycle) in it. A simple example of this is >>> x = [] >>> x.append(x) >>> x [[...]] >>> x[0] is x True So far as I know, it's not possible to create such using only immutable objects. And anything created using only immutable objects will have a hash. A beginner can easily by mistake create an object such as x above, and without the short-circuit provided by >>> x [[...]] the result is an unprintable object, that further raises an exception (or worse). That's really bad for the poor beginner. As a first approximation, my opinion is that data having such a loop / cycle in it is at least a code smell, and perhaps worse. However, there may be examples that I've not considered that would make me change my mind. By the way, in the mathematics of the symmetric group (permutations) the two tuples >>> (0, 1, 2, 3) >>> (1, 2, 3, 0) are different representations of the same cycle (where here the word cycle has a different meaning). Also by the way, determining if two vertex-edge graphs are isomorphic is a hard problem. I've been told that the best tool for this is https://www3.cs.stonybrook.edu/~algorith/implement/nauty/implement.shtml To summarise, it's loops / cycles in the representation of the data that's the problem. Without a use case to the contrary, I'd say don't do this. That's my opinion. Yours may be different. Final idea. Perhaps a tool to detect cycles in data might be a good idea. Finally, Happy New Year. -- Jonathan

Richard Damon

4:08 p.m.

On 1/1/21 9:50 AM, Jonathan Fine wrote:

...

Hi

I'd say the problem isn't recursion. Here I'm using the definitions given in: https://en.wikipedia.org/wiki/Recursion https://en.wikipedia.org/wiki/Recursion https://www.cs.utah.edu/~germain/PPS/Topics/recursion.html https://www.cs.utah.edu/~germain/PPS/Topics/recursion.html

Rather, it's that the data has a loop (or cycle) in it. A simple example of this is >>> x = [] >>> x.append(x) >>> x [[...]] >>> x[0] is x True

So far as I know, it's not possible to create such using only immutable objects. And anything created using only immutable objects will have a hash.

A beginner can easily by mistake create an object such as x above, and without the short-circuit provided by >>> x [[...]] the result is an unprintable object, that further raises an exception (or worse). That's really bad for the poor beginner.

As a first approximation, my opinion is that data having such a loop / cycle in it is at least a code smell, and perhaps worse. However, there may be examples that I've not considered that would make me change my mind.

A very simple and in my mind reasonable use for this is to build a representation of a graph, where each node is represented by a list (or some other collection), and the connections are denoted by adding to that collection the nodes that are adjacent (or maybe a tuple of the node and the distance). This naturally creates a recursive structure unless connections are unidirectional and the graph is acyclical (i.e. a tree). For example, a 3 node fully connected graph might be: a = [] b = [] c = [] a.append(b) a.append(c) b.append(a) b.append(c) c.append(a) c.append(b) Maybe more generally the nodes would be dicts of properties, one of which is a connection property with that list of nodes, but you still end up with the same recursion.

...

By the way, in the mathematics of the symmetric group (permutations) the two tuples >>> (0, 1, 2, 3) >>> (1, 2, 3, 0) are different representations of the same cycle (where here the word cycle has a different meaning).

Also by the way, determining if two vertex-edge graphs are isomorphic is a hard problem. I've been told that the best tool for this is https://www3.cs.stonybrook.edu/~algorith/implement/nauty/implement.shtml https://www3.cs.stonybrook.edu/~algorith/implement/nauty/implement.shtml

To summarise, it's loops / cycles in the representation of the data that's the problem. Without a use case to the contrary, I'd say don't do this. That's my opinion. Yours may be different.

Final idea. Perhaps a tool to detect cycles in data might be a good idea.

Finally, Happy New Year.

-- Jonathan

-- Richard Damon

Jonathan Fine

5:12 p.m.

Hi Richard You suggested A very simple and in my mind reasonable use for this is to build a

...

representation of a graph, where each node is represented by a list (or some other collection), and the connections are denoted by adding to that collection the nodes that are adjacent (or maybe a tuple of the node and the distance). This naturally creates a recursive structure unless connections are unidirectional and the graph is acyclical (i.e. a tree).

For example, a 3 node fully connected graph might be:

a = [] b = [] c = []

a.append(b) a.append(c) b.append(a) b.append(c) c.append(a) c.append(b)

According to https://en.wikipedia.org/wiki/Graph_theory#Graph, a graph is an ordered pair

...

...
...
G = (V, E) where V is a set of vertices, and E is a set of edges, where an edge is an unordered pair of vertices.

You've given the complete graph on 3 vertices (although without explicitly stating the vertex set V). https://en.wikipedia.org/wiki/Complete_graph If we use the wikipedia definition (with vertices 'a', 'b' and 'c') we have

...

...
...
V = {'a', 'b', 'c'} E = {{'a', 'b'}, {'a', 'c'}, {'b', 'c'}}

I've shown how to represent graphs, by directly translating the mathematical definition into Python (without introducing any additional classes). You've produced a different way to represent graphs. Mathematically, your representation is this. There is a set V of vertices. Further, for each v in V there is associated a subset V_v of V. Further, if w in V_v then v in V_w. I wouldn't call what you've written down the Python representation of this mathematical definition. To explain this, I'll write down what I think it should be. Suppose G is a graph with vertices {'a', 'b', 'c', 'd'}. Using a dict to represent the map that v goes to V_v, we can write G as { 'a': {...}, 'b': {...}, # etc 'e': {...}, } My initial post suggests that, purely on the basis of the loop in the data structure, there's a code smell in the representation you provided. I'd say that not agreeing with its own mathematical definition is a code smell (or worse). I'll put it another way. Suppose our vertices are called (or represent) Alice, Bob, Carol, Dave and Eve. Now let G be the graph whose edges represent the existence of cryptographically secure communication between X and Y. I claim that your code isn't able to represent such a graph. Thank you Richard for forcing me to think things through and find what I consider to be a flaw in the code you supplied. -- Jonathan

Richard Damon

6:36 p.m.

On 1/1/21 12:12 PM, Jonathan Fine wrote:

...

Hi Richard

You suggested

A very simple and in my mind reasonable use for this is to build a representation of a graph, where each node is represented by a list (or some other collection), and the connections are denoted by adding to that collection the nodes that are adjacent (or maybe a tuple of the node and the distance). This naturally creates a recursive structure unless connections are unidirectional and the graph is acyclical (i.e. a tree).

For example, a 3 node fully connected graph might be:

a = [] b = [] c = []

a.append(b) a.append(c) b.append(a) b.append(c) c.append(a) c.append(b)

According to https://en.wikipedia.org/wiki/Graph_theory#Graph https://en.wikipedia.org/wiki/Graph_theory#Graph, a graph is an ordered pair

...
...
...
G = (V, E) where V is a set of vertices, and E is a set of edges, where an edge is an unordered pair of vertices.

You've given the complete graph on 3 vertices (although without explicitly stating the vertex set V). https://en.wikipedia.org/wiki/Complete_graph https://en.wikipedia.org/wiki/Complete_graph

If we use the wikipedia definition (with vertices 'a', 'b' and 'c') we have

...
...
...
V = {'a', 'b', 'c'} E = {{'a', 'b'}, {'a', 'c'}, {'b', 'c'}}

I've shown how to represent graphs, by directly translating the mathematical definition into Python (without introducing any additional classes). You've produced a different way to represent graphs.

Mathematically, your representation is this. There is a set V of vertices. Further, for each v in V there is associated a subset V_v of V. Further, if w in V_v then v in V_w.

I wouldn't call what you've written down the Python representation of this mathematical definition. To explain this, I'll write down what I think it should be.

Suppose G is a graph with vertices {'a', 'b', 'c', 'd'}. Using a dict to represent the map that v goes to V_v, we can write G as { 'a': {...}, 'b': {...}, # etc 'e': {...}, }

My initial post suggests that, purely on the basis of the loop in the data structure, there's a code smell in the representation you provided. I'd say that not agreeing with its own mathematical definition is a code smell (or worse).

I'll put it another way. Suppose our vertices are called (or represent) Alice, Bob, Carol, Dave and Eve. Now let G be the graph whose edges represent the existence of cryptographically secure communication between X and Y. I claim that your code isn't able to represent such a graph.

Thank you Richard for forcing me to think things through and find what I consider to be a flaw in the code you supplied.

-- Jonathan

This may be the practical difference betwen Theory and Practice (which in Theory doesn't exist). There are times when processing a system that can be represented as a graph, it is cleanest to think first of node, and then by their connections. Your notation makes it inconvenient that given you are at node A to know what nodes it connects to, you need to search through the E structure to find all entries with the starting node listed. You also get the 'name' of the node, not the node itself, so you need another dictionary to map names to the actual nodes them selves (where more information about the node might be stored). Yes, maybe in pure graph theory all this isn't important, but if I am USING a grahph for actual work, it might be. Also, your example assumes that NAMES are the important piece, how would you build your structure if names do not exist or are not unique, Once you make the node themselves important, then making the links by the node itself instead of by some arbitrary name makes sense. Also, if the graph is directed, like Alice has a way to drop a message into Bobs secure storage box, but Alice doesn't have access to one of her own, so Bob can't use that method to send back to Alice, then it can make sense to make the edge list not a total separate structure, but each node includes the edges it can talk to itself. It can be noted that this is very much like the model of the internet as it exists today. Every machine (a node), keeps track of who it can directly talk to, and a list of who it is next to and guess for who it should try to send messages to in order to get a message to somone it is not next to. There is no master list of edges in existance anywhere (I once did see a master listing of the graph of the network in the early days before automatic routing, such a map would be impossible to build today). Once you move the listing of links into the node itself, as opposed to having a separate listing of all the vertices, and make the links to node rather than names (to avoid an extra unneeded lookup) then you get the recursive data structure. For your example you say that can not be represented, let us build a graph G as a dictionary, with the key being the name and the value being the 'Node' for that person. (This assumes names are unique, otherwise it might just need to be a list) That node will be a dictionary, where we can store all sorts of properties, but lets define one with a key 'secure' that has as its value a listing of all the nodes this node knows how to directly send to. Thus we could define the graph as: g = { 'Alice': {'secure': []}, 'Bob': {'secure': []}, 'Carol': {'secure': []}, 'Dave':, {'secure': []}. 'Eve':, {secure': []}, } now we can add connections with something like: g['Alice']['secure'].append(g['Bob']) Note that this structure doesn't really even need g to exist at all, it is only needed to initially find the nodes, the nodes could easily be independent variables, or perhaps different nodes are in different structures if they are obtained from different sources. (Node likely will have a property in them to provide a human recognizable name for them). Yes, perhaps at this point it might make sense to make the nodes an actual class as opposed to just a dict, which sort of breaks the cycle as by default __repr__ and such don't expand through class objects, but that then changes the mindset from a data-driven to an object based design, and depending what else the program needs to interface to one or the other might be more complicated. -- Richard Damon

Jeff Allen

5:14 p.m.

On 01/01/2021 14:50, Jonathan Fine wrote:

...

... To summarise, it's loops / cycles in the representation of the data that's the problem. ... .

Final idea. Perhaps a tool to detect cycles in data might be a good idea.

Detecting cycles will involve a lot of bikeshedding. (Sorry, couldn't resist.) But you're right, that's a better word for discussing the problem. Steven's problem data structures are cyclic graphs. I don't agree that such structures are a sign one is doing something wrong (outside the naive approach to printing them out). If memory serves me right, what we're talking about is finding and walking a "spanning tree" of the graph. There is a twist in our case that we would like to insert a special object where a link was elided to break the cycle: an object that maybe contains the original reference but would reproduce as "..." in print. Since it appears necessary to create a data structure with as many nodes as there are in the tree, just to remember where we've been, we may as well say that the required result takes the form of an iterable of the nodes, which may subsequently be iterated by a consumer. Actually, I can think of two styles of iterator: one where "..." objects are skipped by a "processing" iterator, and one where they are delivered by a "print" iterator. It is valid to do this with objects that can't be hashed, so the structure is perhaps a dict keyed by id. Would this be possible as a general capability? I think only if the references were in the __dict__ of each instance. With a built-in, one is at the mercy of the implementor. But we must have to do something very similar to pickle arbitrary (potentially cyclic) structures, which is likewise dependent on special help from the particular built-in type. Can something be founded on |||__getstate__| ? Happy New Year Jeff Allen

Jonathan Fine

6:08 p.m.

Hi Jeff

...

But you're right, that's a better word for discussing the problem. Steven's problem data structures are cyclic graphs. I don't agree that such structures are a sign one is doing something wrong (outside the naive approach to printing them out).

Thank you for agreeing with me, and nicely disagreeing with me. However, the problem of 'walking a graph' to find a spanning tree is first of all a matter of choosing an algorithm. The algorithm to be used and the representation of the graph are two different matters. Suppose the vertices are well ordered, for example (0, 1, 2, 3, .., n). Write each edge as an ORDERED pair, with the first vertex being less than the second vertex. Now list the edges in lexicographic order. That solves the problem, and gives a unique representation, provided we are given a well ordering of the vertices. Under certain circumstances, we can use the Python id to provide the well ordering of vertices. However, that doesn't always work. Two strings (or large ints, or tuples) can be equal even though they have different ids. However, I think it reasonable to assume that we have a sensible equality relation on the vertices. Now things get interesting. For a very large graph, it would be very helpful if every vertex has a hash. This would make determining equality easier. But if the data structure representing the graph has Python cycles in it (as in Steven's use case) then there won't be a hash available. Do you agree with me, Jeff, that the problem of printing out a vertices-and-edges graph is much easier (or at least quicker) if each vertex has a hash. More exactly, we want a quick and easy way to determine if two vertices are equal. It would be nice to have some further input from Steven, whose original post stated the problem he wished to solve. -- Jonathan

Richard Damon

6:41 p.m.

On 1/1/21 12:14 PM, Jeff Allen wrote:

...

On 01/01/2021 14:50, Jonathan Fine wrote:

...
... To summarise, it's loops / cycles in the representation of the data that's the problem. ... .

Final idea. Perhaps a tool to detect cycles in data might be a good idea.

Detecting cycles will involve a lot of bikeshedding. (Sorry, couldn't resist.)

But you're right, that's a better word for discussing the problem. Steven's problem data structures are cyclic graphs. I don't agree that such structures are a sign one is doing something wrong (outside the naive approach to printing them out).

If memory serves me right, what we're talking about is finding and walking a "spanning tree" of the graph. There is a twist in our case that we would like to insert a special object where a link was elided to break the cycle: an object that maybe contains the original reference but would reproduce as "..." in print.

Since it appears necessary to create a data structure with as many nodes as there are in the tree, just to remember where we've been, we may as well say that the required result takes the form of an iterable of the nodes, which may subsequently be iterated by a consumer. Actually, I can think of two styles of iterator: one where "..." objects are skipped by a "processing" iterator, and one where they are delivered by a "print" iterator. It is valid to do this with objects that can't be hashed, so the structure is perhaps a dict keyed by id.

Would this be possible as a general capability? I think only if the references were in the __dict__ of each instance. With a built-in, one is at the mercy of the implementor. But we must have to do something very similar to pickle arbitrary (potentially cyclic) structures, which is likewise dependent on special help from the particular built-in type. Can something be founded on |||__getstate__| ?

Happy New Year

Jeff Allen

I believe that one solution to detecting the cycles is to create a set of the object IDs you have visited and started to print. If you come across an ID you have already seen, this point is in a cycle. Sets are fairly compact and intentionally fast to search for an item. -- Richard Damon

Jonathan Fine

7 p.m.

Hi Richard You wrote I believe that one solution to detecting the cycles is to create a set

...

of the object IDs you have visited and started to print. If you come across an ID you have already seen, this point is in a cycle. Sets are fairly compact and intentionally fast to search for an item.

Indeed. But I see a flaw. The problem is that we want to know about EQUALITY of nodes, not equality of the ID (memory address in disguise) at which the node is stored. In other words, as stated earlier, things are easier and quicker if the nodes are hashable. Only hashable objects can be stored in a set, and equality of NODE doesn't imply equality of ID. (The converse is trivially true.) Here's an example >>> def f(): return 10**1000 >>> list(map(id, (f(), f()))) [139927013695536, 139927013696008] By the way, this surprised me. Would anyone like to explain this? >>> id(f()), id(f()) (139927013695536, 139927013695536) -- Jonathan

Richard Damon

7:36 p.m.

On 1/1/21 2:00 PM, Jonathan Fine wrote:

...

Hi Richard

You wrote

I believe that one solution to detecting the cycles is to create a set of the object IDs you have visited and started to print. If you come across an ID you have already seen, this point is in a cycle. Sets are fairly compact and intentionally fast to search for an item.

Indeed. But I see a flaw. The problem is that we want to know about EQUALITY of nodes, not equality of the ID (memory address in disguise) at which the node is stored.

In other words, as stated earlier, things are easier and quicker if the nodes are hashable. Only hashable objects can be stored in a set, and equality of NODE doesn't imply equality of ID. (The converse is trivially true.)

Here's an example >>> def f(): return 10**1000 >>> list(map(id, (f(), f()))) [139927013695536, 139927013696008]

By the way, this surprised me. Would anyone like to explain this? >>> id(f()), id(f()) (139927013695536, 139927013695536)

-- Jonathan

Second problem first, Two numbers that are equal may or might not be stored in the same object. It is unspecified. As to the first, just because to points have equal values doesn't mean that we have recursion. That is why I put the id()s in the set, the id()s are by definition hashable, and an object always has the same id() and no two different objects can have the same id(). Simple example: list1 = [ ['one', 'two'], ['one', 'two'] ] The first and second member of the above list are equal, but not the same object. Thus there is no reason to want to try to trim the second from an output. list1 = ['one', 'two'] list2 = [list1, list1] now the first and second member of the list list2 are both equal and the same object. Perhaps we might want to indicate this in a dump, but maybe not. list1 = ['one', 'two'] list1.append(list11) Now the third element of the list list1 IS the same object as list1, so we definitely want to not try and show it when we display the value of list1, but instead indicate that we have recursion. Even further: list1 = ['one', 'two'] list1.append(list1) list2 = ['one', 'two', list1] Now list2 will be equal to list1 but is a distinct object. Now the expansion of list2 will be: ['one', 'two', ['one', 'two', [...]]] We don't want to test for EQUAL nodes, we want to test for SAME nodes (same object, same id()) -- Richard Damon

Jeff Allen

9:16 p.m.

On 01/01/2021 19:36, Richard Damon wrote:

...

On 1/1/21 2:00 PM, Jonathan Fine wrote:

...
Hi Richard

You wrote

I believe that one solution to detecting the cycles is to create a set of the object IDs you have visited and started to print. If you come across an ID you have already seen, this point is in a cycle. Sets are fairly compact and intentionally fast to search for an item.

Indeed. But I see a flaw. The problem is that we want to know about EQUALITY of nodes, not equality of the ID (memory address in disguise) at which the node is stored.

As to the first, just because to points have equal values doesn't mean that we have recursion.

That is why I put the id()s in the set, the id()s are by definition hashable, and an object always has the same id() and no two different objects can have the same id().

Simple example:

list1 = [ ['one', 'two'], ['one', 'two'] ]

This makes me realise that the spanning tree I brought into the discussion is very likely a red herring. The print (or more generally iteration/visitation) of the structure needs to be curtailed not at an object you already visited, but only at an object that is between you and the root. (In the case of printing or serialisation, that is an object you are part-way through processing.) Here, you definitely want to visit a three times when printing b: >>> a = [1, 2, 3] >>> b = [a, a, a] >>> b [[1, 2, 3], [1, 2, 3], [1, 2, 3]] But after: >>> c = [4, a, b] >>> a.append(c) revisiting a and b is to be avoided during the visit to c >>> b [[1, 2, 3, [4, [...], [...]]], [1, 2, 3, [4, [...], [...]]], [1, 2, 3, [4, [...], [...]]]] However, I'm definitely with Richard that it is the identity of the node that matters, not its value (hashable or not). Of course, I'm assuming I know what kind of iteration Steven wanted, based on the reference to printing (repr). The question remains whether this can be done as a general capability, and I still see the problem as the inscrutability of built-in types (and their repr, etc.). Jeff

Eric V. Smith

7:53 p.m.

On 1/1/2021 2:00 PM, Jonathan Fine wrote:

...

By the way, this surprised me. Would anyone like to explain this? >>> id(f()), id(f()) (139927013695536, 139927013695536)

id's are only unique while an object exists. Here the object returned by the first call is discarded after its id is taken. Then the second object is created and reuses the memory address that was used by the first object. Since in CPython id() returns object's memory address, the id's are the same. Eric

Steven D'Aprano

2 Jan 2 Jan

2:01 a.m.

On Fri, Jan 01, 2021 at 02:53:33PM -0500, Eric V. Smith wrote:

...

On 1/1/2021 2:00 PM, Jonathan Fine wrote:

...
By the way, this surprised me. Would anyone like to explain this? >>> id(f()), id(f()) (139927013695536, 139927013695536)

id's are only unique while an object exists. Here the object returned by the first call is discarded after its id is taken. Then the second object is created and reuses the memory address that was used by the first object. Since in CPython id() returns object's memory address, the id's are the same.

What Eric said, except I would put more emphasis on two things: 1. I would emphasize more strongly that IDs are abstract identity numbers, not memory addresses except as an accident of implementation. For example in Jython and IronPython, the IDs are sequential numbers and won't be reused. In general, if the Python implementation is using a compacting garbage collector where objects can move in memory, IDs cannot be memory addresses. 2. And that it's a matter of happenstance, not design, that the same memory address happens to be used. I can't produce a demonstration right now, but I've seen senarios where every *second* temporary object reuses the previous IDs, so printing a sequence of IDs goes something like this: 1012, 1042, 1012, 1042, 1012, 1042 ... Presumably that could happen if the garbage collector takes longer to reclaim each temporary object than it takes the interpreter to generate and process the next one. -- Steve

Jeff Allen

1 Jan 1 Jan

8:43 p.m.

New subject: (Off-topic) Different object, same id

On 01/01/2021 19:00, Jonathan Fine wrote:

...

By the way, this surprised me. Would anyone like to explain this? >>> id(f()), id(f()) (139927013695536, 139927013695536)

That made me think. The result of the first f() has gone out of scope by the time the second call, and so the id (that is, the memory) is available next time CPython needs a place for an int of about the same bit_length.

...

...
...
def f(): return 10**1000

...

...
...
def g(): return 9**1048

...

...
...
id(f()), id(g()) (2149044150320, 2149044150320)

Neat demonstration that the id of an object is only unique in its lifetime. Jeff

Jonathan Fine

8:55 p.m.

New subject: (Off-topic) Different object, same id

Bravo, Jeff. I couldn't have chosen a better example. However, I'd expect large ints to be stored in two parts. A (class, pointer) pair which has fixed size. And memory referenced by the pointer, large enough to store the value of the large int. However, that's part of the language implementation, and not part of the language definition. -- Jonathan

Chris Angelico

11:20 p.m.

New subject: (Off-topic) Different object, same id

On Sat, Jan 2, 2021 at 7:56 AM Jonathan Fine wrote:

...

Bravo, Jeff. I couldn't have chosen a better example.

However, I'd expect large ints to be stored in two parts. A (class, pointer) pair which has fixed size. And memory referenced by the pointer, large enough to store the value of the large int.

However, that's part of the language implementation, and not part of the language definition.

Off-topic for python-ideas, but a bit of explanation... It doesn't matter how they're stored. The fact is that an integer - whether small or large - is an object, and every object has an identity, which can be tested with the "is" operator and the id() function. Python guarantees a few things about object identity: 1) Two distinct and concurrently-existing objects have different IDs. 2) Any given object always has the same ID. 3) Therefore "x is y" implies "id(x) == id(y)" and vice versa - as long as both objects exist simultaneously. There are a few things Python specifically doesn't guarantee: 1) Immutable and indistinguishable objects MAY be interned but may also be distinct. 2) If an object can no longer be found in any way, its ID may be reused. It's not possible to test "x is y" if you can't find x, therefore it's not possible to disprove the "x is y implies id(x) == id(y)" statement. 3) IDs have no particular meaning. They're not addresses, they're not sequential, they're just arbitrary numbers. (Some Python implementations don't even assign ID numbers until they're requested, which is fine as long as the three guarantees above are held.) Any time you save an ID longer than you save the original object, you open yourself up to the possibility of reuse - but depending on exactly what code you run, you might not actually see it. And some Python interpreters will never show that. OTOH, it's definitely possible for integers, strings, tuples, etc to reuse the same object even though they don't seem to need to; but there's absolutely no guarantee: Python 3.10.0a0 (heads/master:497126f7ea, Oct 2 2020, 21:22:13) [GCC 6.3.0 20170516] on linux Type "help", "copyright", "credits" or "license" for more information.

...

...
...
def f1(): return id("Hello") ... def f2(): return id("Hello") ... def f3(): return id("Hello") ... print(f1(), f2(), f3()) 140064512446192 140064512446192 140064512446192 def f1(): return id("Hello world") ... def f2(): return id("Hello world") ... def f3(): return id("Hello world") ... print(f1(), f2(), f3()) 140064512446064 140064512471728 140064512471536

CPython automatically interns *some* string literals, but not others :) A compliant Python implementation is welcome to return the same ID from each of these functions, or to return distinct IDs. ChrisA

Greg Ewing

2 Jan 2 Jan

12:06 a.m.

On 2/01/21 6:14 am, Jeff Allen wrote:

...

we may as well say that the required result takes the form of an iterable of the nodes, which may subsequently be iterated by a consumer.

In general you need more than that, I think. If you're printing a representation of the graph, you want to know about the structure, not just get a flat list of nodes.

...

But we must have to do something very similar to pickle arbitrary (potentially cyclic) structures, which is likewise dependent on special help from the particular built-in type. Can something be founded on |||__getstate__| ?

I don't think so. All the logic for dealing with cycles is buried inside pickle -- __getstate__ just gets info about one object. -- Greg

Samuel Freilich

12:38 a.m.

Interesting to look at the code for some of this. list.repr and dict.repr seem to be taking the "construct a set of seen items" approach. Except it's not using a set of ids, it's doing a linear pass over a list of visited PyObjects each time (which seems somewhat surprising to me, though it's only O(n*d) where d is the nesting depth, not O(n^2)): https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09... copy.deepcopy keeps track with a dict: https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09... (That uses a sentinel object as the default for the lookup so it can tell if that's explicitly set to None, though that seems convoluted compared to just checking whether the value is None or not. Is that considering that "copy.deepcopy(x) is None and x is not None" might be true?) Peace, -Sam On Fri, Jan 1, 2021 at 7:07 PM Greg Ewing wrote:

...

On 2/01/21 6:14 am, Jeff Allen wrote:

...
we may as well say that the required result takes the form of an iterable of the nodes, which may subsequently be iterated by a consumer.

In general you need more than that, I think. If you're printing a representation of the graph, you want to know about the structure, not just get a flat list of nodes.

...
But we must have to do something very similar to pickle arbitrary (potentially cyclic) structures, which is likewise dependent on special help from the particular built-in type. Can something be founded on |||__getstate__| ?

I don't think so. All the logic for dealing with cycles is buried inside pickle -- __getstate__ just gets info about one object.

-- Greg

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/3FMDYC... Code of Conduct: http://python.org/psf/codeofconduct/

Richard Damon

12:54 a.m.

It might not be that surprising to use a list if you are already using that list to keep track of your position in the hierarchy. You can probably assume that most of the time the depth is fairly low so the cost of scanning is low enough that the cost of building and updating the set costs enough that there aren't significant savings. On 1/1/21 7:38 PM, Samuel Freilich via Python-ideas wrote:

...

Interesting to look at the code for some of this.

list.repr and dict.repr seem to be taking the "construct a set of seen items" approach. Except it's not using a set of ids, it's doing a linear pass over a list of visited PyObjects each time (which seems somewhat surprising to me, though it's only O(n*d) where d is the nesting depth, not O(n^2)): https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09... https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09...

copy.deepcopy keeps track with a dict: https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09... https://github.com/python/cpython/blob/3bf05327c2b25d42b92795d9d280288c22a09...

(That uses a sentinel object as the default for the lookup so it can tell if that's explicitly set to None, though that seems convoluted compared to just checking whether the value is None or not. Is that considering that "copy.deepcopy(x) is None and x is not None" might be true?)

Peace, -Sam

On Fri, Jan 1, 2021 at 7:07 PM Greg Ewing mailto:greg.ewing@canterbury.ac.nz> wrote:

On 2/01/21 6:14 am, Jeff Allen wrote: > we may as > well say that the required result takes the form of an iterable of the > nodes, which may subsequently be iterated by a consumer.

In general you need more than that, I think. If you're printing a representation of the graph, you want to know about the structure, not just get a flat list of nodes.

> But we must have to do something > very similar to pickle arbitrary (potentially cyclic) structures, which > is likewise dependent on special help from the particular built-in type. > Can something be founded on |||__getstate__| ?

I don't think so. All the logic for dealing with cycles is buried inside pickle -- __getstate__ just gets info about one object.

-- Greg

-- Richard Damon

Brendan Barnwell

1 Jan 1 Jan

7:42 p.m.

On 2020-12-31 21:34, Steven D'Aprano wrote:

...

Is there something we can do to make it easier for people to deal with recursive data structures, like the list repr and deepcopy code does? Is there a pattern we can abstract out and provide as a tool or recipe for people to use in their own code?

It sounds promising but so far a bit too vague for me to understand what you're really suggesting. What form would this "something" take? Are you envisioning, say, a function safe_recursive_iter so that you could do `for item in safe_recursive_iter(obj)` and it would cleanly raise an exception if it wound up recursing through the same object more than once, rather than getting stuck in an infinite loop? -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

Steven D'Aprano

2 Jan 2 Jan

1:13 a.m.

On Fri, Jan 01, 2021 at 11:42:12AM -0800, Brendan Barnwell wrote:

...

On 2020-12-31 21:34, Steven D'Aprano wrote:

...
Is there something we can do to make it easier for people to deal with recursive data structures, like the list repr and deepcopy code does? Is there a pattern we can abstract out and provide as a tool or recipe for people to use in their own code?

It sounds promising but so far a bit too vague for me to understand what you're really suggesting.

It is vague to me too, and I don't really have any concrete suggestions. I was hoping to start a discussion that might lead to some concrete suggestions. -- Steve

Cameron Simpson

1 Jan 1 Jan

9:35 p.m.

On 01Jan2021 16:34, Steven D'Aprano wrote:

...

This isn't so much an idea for Python, as a request for ideas to solve a problem in Python.

Back in the early days of Python, printing recursive lists could crash the interpreter:

a = [1, 2, 3] a.append(a) print(a)

If you try that today, you get a nice display:

[1, 2, 3, [...]] [...] Is there something we can do to make it easier for people to deal with recursive data structures, like the list repr and deepcopy code does? Is there a pattern we can abstract out and provide as a tool or recipe for people to use in their own code?

I don't know about a tool - that might be tricky to generalise - but my standard pattern for this is: def recurse(o, seen=None): if seen is None: seen = set() if id(o) in seen: return seen.add(id(o)) walk o and call recurse(sub_object, seen) ... The key to store in seen isn't always id(o), and for some tasks I discard the key on my way back out of the recursion. I can imagine writing a decorator for this. I'd probably like it to look like this: @no_recurse(keyfunc=None,popkey=False): def recurse(o, *, seen): ... recurse(sub_object, seen) maybe (untested): @decorator def no_recurse(func, keyfunc=None, popkey=False): if keyfunc is None: keyfunc = id def no_recurse_wrapper(o, *a, seen=None, **kw): if seen is None: seen = set() k = keyfunc(o) if k in seen: return None seen.add(k) try: return func(o, *a, seen=seen, **kw) finally: if popkey: seen.discard(k) The "@decorator" decorator is one of my own which decorates a decorator which may accept parameter of not, allowing both: @no_recurse def recurse(o, blah..., *, seen, other_kwargs...): and also: @no_recurse(keyfunc=lambda o: o.idkey()) def recurse(o, blah..., *, seen, other_kwargs...): for some nondefault decoration. There are some missing features there: what to return when recusion is encoutered (rather than None above), some mode to make the occurrence of recursion trivially noticeable by the primary function to do something better than "return None", etc etc. Cheers, Cameron Simpson

1209

Age (days ago)

1210

Last active (days ago)

List overview

Download

24 comments

13 participants

participants (13)

2QdxY4RzWzUUiLuE＠potatochowder.com
Brendan Barnwell
Cameron Simpson
Chris Angelico
Eric V. Smith
Greg Ewing
Jeff Allen
Jonathan Fine
Marco Sulla
Richard Damon
Samuel Freilich
Stestagg
Steven D'Aprano

Standard tool for iterating over recursive data structures?

tags

participants (13)