Faster object representation for UIs

Hello, I'd like to bring to your attention https://bugs.python.org/issue41383. The core idea here is per Elizaveta Shashkova: I would like to have a lazy repr evaluation for the objects! Sometimes users have many really large objects, and when debugger is trying to show them in Variables View (=show their string representation) it can takes a lot of time. We do some tricks, but they not always work. It would be really-really cool to have parameter in repr, which defines max number of symbols we want to evaluate during repr for this object. Maybe repr is not the best here, because that should be interpreter meaningful, but instead the __str__ method that's better for this. Maybe we could pass in an optional limit argument to these methods, so that the user can decide what to print depending on how many characters he has left? Any takes, better ideas how we could help this problem? Thanks, Bernat

On 2020-07-24 at 15:10:46 -0000, Gábor Bernát <jokerjokerer@gmail.com> wrote:
Any takes, better ideas how we could help this problem?
Use the pretty printer/formatter.¹ Start with a small depth and let the user adjust it. ¹ https://docs.python.org/3/library/pprint.html

That is definitely not language-behavior material - and should be a worry of the authors of whatever projects have objects that demand so much processing to generate a "repr". It certainly is not a common problema met with - I often have to deal with cumbersome repr's (even in my own projects), but due to their size. If an object performs recursive external-resource queries just for ordinarily printing a repr, to a point it is getting in the way of interactive use it obviously should not, and should create a separate "full_repr(...)" method for that. note that you can _already_ accept an optional parameter in the `__repr__` method to behave like you proposed - or your `__repr__` could check a setting somewhere to find out the behavior it should have. On Fri, 24 Jul 2020 at 12:14, Gábor Bernát <jokerjokerer@gmail.com> wrote:

While I agree the implementation on how to represent in limited space the object should be the responsibility of the project that provides objects of long size, I think it's a language-behaviour material what type of solution we want to propose/recommend.

On Fri, 24 Jul 2020 at 16:43, Gábor Bernát <jokerjokerer@gmail.com> wrote:
While I agree the implementation on how to represent in limited space the object should be the responsibility of the project that provides objects of long size, I think it's a language-behaviour material what type of solution we want to propose/recommend.
I still don't see why. Unless you consider "use something application specific" as a language-behaviour type of recommendation. I'm still seeing nothing here that suggests that "modify something provided by the language or stdlib" is the best answer. Maybe the point of the original question is that you're looking for a function that can be used by *any* debugger, rather than a solution for one specific debugger. In that case, the people writing debuggers should agree on a standard protocol (maybe something like debugger_repr(object)) that they will use. But it's still not a language-level question, any more than (for example) the definition of the numpy ufunc mechanism is a language matter... I'm clearly missing the point of your question here. Can you clarify? Paul

On Sat, Jul 25, 2020 at 1:18 AM Gábor Bernát <jokerjokerer@gmail.com> wrote:
I honestly don't think that either __repr__ or __str__ is appropriate for this. You need some sort of hook that has, potentially, a lot of debugger hooks in it. I would say it's best handled by some sort of multiple dispatch within the debugger itself; it can handle core data types (list/tuple, dict) and then provide hooks for custom types to register themselves with it. But one thing that would be kinda nice would be to have a way for a class to say "I'm like a dict, but with extra info". Consider defaultdict and Counter:
Both of them include a dict-like repr in their reprs, and both of them would probably want to have the debugger display them in a dict-like way too. Maybe reprlib would be the place for something like this? ChrisA

On Fri, 24 Jul 2020 at 16:15, Gábor Bernát <jokerjokerer@gmail.com> wrote:
Why not just use a custom function for this? I don't understand why this has to be coupled to repr, or indeed to anything that's special to the repr. The debugger (presumably a custom application) could call a custom function to generate the string representation, and that function could have any API it wants. The default implementation of the function (functools.singledispatch seems like it would be ideal for this) could just call repr, so that objects that don't need special treatment would use repr. This doesn't seem like it's something that should need language support at all. Paul

You could do it with a custom function, however the hope in this e-mail thread was that the language should agree on this function name, and ideally should be __str__/__repr__ with an optional argument. And then we should implement stdlib types to follow this custom logic (think e.g. of repr-ing an array that has lots of values). On Fri, Jul 24, 2020 at 5:59 PM Paul Moore <p.f.moore@gmail.com> wrote:

But adding an optional parameter to an existing dunder is pretty much the worst choice. Every existing method of that name would have to be altered, or you’d end up with horrible code to cope with it — either catching exceptions or introspection. On Fri, Jul 24, 2020 at 10:35 Bernat Gabor <jokerjokerer@gmail.com> wrote:
-- --Guido (mobile)

You may be interested in my library https://github.com/alexmojaki/cheap_repr It was created precisely for the purpose of generating many reprs quickly for my debugging libraries. On Fri, Jul 24, 2020 at 5:17 PM Gábor Bernát <jokerjokerer@gmail.com> wrote:

24.07.20 18:10, Gábor Bernát пише:
We need a structural repr protocol, which would represent complex object as a structure containing items and attributes, so pprint() would know how to format a multiline text representation, and graphic tools could represent objects as a tree, with collapsed by default deep children and long sequences which can be expanded interactively. It was discussed in the past, but we still do not have good specification of such protocol.

Hi! Thanks everyone for the interest and for the suggested options! I would like ro add my two coins and clarify some moments as the original requester of this feature. 1. We need this lazy `__repr__` calculation inside our debugger, where we work with different user's objects. Usually it isn't some specific type, for which you know that it'll be big and its `__repr__` calculation will be slow (like, for example, pandas.DataFrame). Sometimes it can be just a composition of builtin types, like in the example below. On the top level it's just a `dict` with 10 elements and you don't expect its `repr()` to be slow, but it takes 13 secs on my machine to calculate it. ``` import time def build_data_object(): data = dict() for i in range(10): temp_dict = dict() for j in range(10): temp_dict[str(j)] = "a" * 30000000 data[str(i)] = temp_dict return data obj = build_data_object() start = time.time() repr(obj) finish = time.time() print("Time: %.2f" % (finish - start)) ``` 2. I also agree it isn't the best idea to add additional parameters to `repr` or `str`. Just a function like `lazy_repr` implemented in stdlib will be already very useful. 3. But I also believe this issue can't be solved without changes in the language or stdlib, because you can't predict the length of `repr` for an object of unknown type without calculation of the whole string. But I hope it should be possible to check current buffer size during `__repr__` generation and interrupt it when it reaches the limit. (sorry, I'm not a CPython developer and I might be too naive here, so correct me if I'm wrong). Elizaveta Shashkova. сб, 25 июл. 2020 г. в 13:27, Serhiy Storchaka <storchaka@gmail.com>:

Hello, On Sat, 25 Jul 2020 16:34:16 +0300 Elizabeth Shashkova <elizabeth.shashkova@gmail.com> wrote:
Did you consider that calling a __repr__ on an object may start formatting your (or user's) hard drive (arbitrary code is executed)? Alternatively, that may take a long time (maybe it mines bitcoin on each call), as you discovered. So, the behavior of calling __repr__ arbitrarily doesn't seem to be a good approach for a *generic* Python debugger. Then probably you don't write a generic Python debugger, but an adhoc, special-purpose one. And the problems you're facing is with specific object types, and it may be a good idea to contact projects which supply those types and tell them that you special-purpose debugger has problems with them. I still wonder what chance may be that the answer would be "don't call __repr__ then". Speaking of technical side, it's indeed sad at times that Python's __str__/__repr__ require to materialize entire representation in memory. More frugal approach would be: def __stream_repr__(self, stream): stream.write("<MyObj ") # Produce the rest of representation piece-wise, calling # stream.write() stream.write(">") This can be a great economy of memory, for implementations which care about that (is that CPython?). That actually could provide means to address the posed problem, as "stream" can be a custom stream object, whose .write() method checks e.g. representation size or time budget, and if it's exceeded, throws an exception, which should be called at code which calls repr() in the first place. Other alternative could be turning __repr__ into generator (i.e., introducing __irepr__), but that's quite an expensive solution, given that generator instance need to be heap-allocated on each call, before it can be iterated over.
-- Best regards, Paul mailto:pmiscml@gmail.com

Well, about pprint, the solution is not easy... but about debugging, pytest for example does not show the full repr of an object, if the object is too large. You have to run pytest with the flag -v or -vv. Maybe that code could be moved in a separate little library. Another simple solution could be to not call repr for nested objects, maybe introducing a parameter (deep=True?) I'm thinking about github: when you see a diff, if the file is heavily modified, the diff is not shown. But you can click and see. This way, a debugger could cache the repr of an object (and its nested objects separately) and see if it's too big. If so, it could return the generic repr (<__main__.A object at 0x7fb3ce417e50>) and offer you to click if you want to see the full contents. On Sat, 25 Jul 2020 at 16:23, Paul Sokolovsky <pmiscml@gmail.com> wrote:

On 26/07/20 1:34 am, Elizabeth Shashkova wrote:
Seems to me it would be better for the debugger to take a conservative approach here, and assume that the repr *will* be big and/or slow unless it's a type that it *does* know about. All other objects should be displayed using something similar to object.__repr__ unless the user requests otherwise. -- Greg

On 2020-07-24 at 15:10:46 -0000, Gábor Bernát <jokerjokerer@gmail.com> wrote:
Any takes, better ideas how we could help this problem?
Use the pretty printer/formatter.¹ Start with a small depth and let the user adjust it. ¹ https://docs.python.org/3/library/pprint.html

That is definitely not language-behavior material - and should be a worry of the authors of whatever projects have objects that demand so much processing to generate a "repr". It certainly is not a common problema met with - I often have to deal with cumbersome repr's (even in my own projects), but due to their size. If an object performs recursive external-resource queries just for ordinarily printing a repr, to a point it is getting in the way of interactive use it obviously should not, and should create a separate "full_repr(...)" method for that. note that you can _already_ accept an optional parameter in the `__repr__` method to behave like you proposed - or your `__repr__` could check a setting somewhere to find out the behavior it should have. On Fri, 24 Jul 2020 at 12:14, Gábor Bernát <jokerjokerer@gmail.com> wrote:

While I agree the implementation on how to represent in limited space the object should be the responsibility of the project that provides objects of long size, I think it's a language-behaviour material what type of solution we want to propose/recommend.

On Fri, 24 Jul 2020 at 16:43, Gábor Bernát <jokerjokerer@gmail.com> wrote:
While I agree the implementation on how to represent in limited space the object should be the responsibility of the project that provides objects of long size, I think it's a language-behaviour material what type of solution we want to propose/recommend.
I still don't see why. Unless you consider "use something application specific" as a language-behaviour type of recommendation. I'm still seeing nothing here that suggests that "modify something provided by the language or stdlib" is the best answer. Maybe the point of the original question is that you're looking for a function that can be used by *any* debugger, rather than a solution for one specific debugger. In that case, the people writing debuggers should agree on a standard protocol (maybe something like debugger_repr(object)) that they will use. But it's still not a language-level question, any more than (for example) the definition of the numpy ufunc mechanism is a language matter... I'm clearly missing the point of your question here. Can you clarify? Paul

On Sat, Jul 25, 2020 at 1:18 AM Gábor Bernát <jokerjokerer@gmail.com> wrote:
I honestly don't think that either __repr__ or __str__ is appropriate for this. You need some sort of hook that has, potentially, a lot of debugger hooks in it. I would say it's best handled by some sort of multiple dispatch within the debugger itself; it can handle core data types (list/tuple, dict) and then provide hooks for custom types to register themselves with it. But one thing that would be kinda nice would be to have a way for a class to say "I'm like a dict, but with extra info". Consider defaultdict and Counter:
Both of them include a dict-like repr in their reprs, and both of them would probably want to have the debugger display them in a dict-like way too. Maybe reprlib would be the place for something like this? ChrisA

On Fri, 24 Jul 2020 at 16:15, Gábor Bernát <jokerjokerer@gmail.com> wrote:
Why not just use a custom function for this? I don't understand why this has to be coupled to repr, or indeed to anything that's special to the repr. The debugger (presumably a custom application) could call a custom function to generate the string representation, and that function could have any API it wants. The default implementation of the function (functools.singledispatch seems like it would be ideal for this) could just call repr, so that objects that don't need special treatment would use repr. This doesn't seem like it's something that should need language support at all. Paul

You could do it with a custom function, however the hope in this e-mail thread was that the language should agree on this function name, and ideally should be __str__/__repr__ with an optional argument. And then we should implement stdlib types to follow this custom logic (think e.g. of repr-ing an array that has lots of values). On Fri, Jul 24, 2020 at 5:59 PM Paul Moore <p.f.moore@gmail.com> wrote:

But adding an optional parameter to an existing dunder is pretty much the worst choice. Every existing method of that name would have to be altered, or you’d end up with horrible code to cope with it — either catching exceptions or introspection. On Fri, Jul 24, 2020 at 10:35 Bernat Gabor <jokerjokerer@gmail.com> wrote:
-- --Guido (mobile)

You may be interested in my library https://github.com/alexmojaki/cheap_repr It was created precisely for the purpose of generating many reprs quickly for my debugging libraries. On Fri, Jul 24, 2020 at 5:17 PM Gábor Bernát <jokerjokerer@gmail.com> wrote:

24.07.20 18:10, Gábor Bernát пише:
We need a structural repr protocol, which would represent complex object as a structure containing items and attributes, so pprint() would know how to format a multiline text representation, and graphic tools could represent objects as a tree, with collapsed by default deep children and long sequences which can be expanded interactively. It was discussed in the past, but we still do not have good specification of such protocol.

Hi! Thanks everyone for the interest and for the suggested options! I would like ro add my two coins and clarify some moments as the original requester of this feature. 1. We need this lazy `__repr__` calculation inside our debugger, where we work with different user's objects. Usually it isn't some specific type, for which you know that it'll be big and its `__repr__` calculation will be slow (like, for example, pandas.DataFrame). Sometimes it can be just a composition of builtin types, like in the example below. On the top level it's just a `dict` with 10 elements and you don't expect its `repr()` to be slow, but it takes 13 secs on my machine to calculate it. ``` import time def build_data_object(): data = dict() for i in range(10): temp_dict = dict() for j in range(10): temp_dict[str(j)] = "a" * 30000000 data[str(i)] = temp_dict return data obj = build_data_object() start = time.time() repr(obj) finish = time.time() print("Time: %.2f" % (finish - start)) ``` 2. I also agree it isn't the best idea to add additional parameters to `repr` or `str`. Just a function like `lazy_repr` implemented in stdlib will be already very useful. 3. But I also believe this issue can't be solved without changes in the language or stdlib, because you can't predict the length of `repr` for an object of unknown type without calculation of the whole string. But I hope it should be possible to check current buffer size during `__repr__` generation and interrupt it when it reaches the limit. (sorry, I'm not a CPython developer and I might be too naive here, so correct me if I'm wrong). Elizaveta Shashkova. сб, 25 июл. 2020 г. в 13:27, Serhiy Storchaka <storchaka@gmail.com>:

Hello, On Sat, 25 Jul 2020 16:34:16 +0300 Elizabeth Shashkova <elizabeth.shashkova@gmail.com> wrote:
Did you consider that calling a __repr__ on an object may start formatting your (or user's) hard drive (arbitrary code is executed)? Alternatively, that may take a long time (maybe it mines bitcoin on each call), as you discovered. So, the behavior of calling __repr__ arbitrarily doesn't seem to be a good approach for a *generic* Python debugger. Then probably you don't write a generic Python debugger, but an adhoc, special-purpose one. And the problems you're facing is with specific object types, and it may be a good idea to contact projects which supply those types and tell them that you special-purpose debugger has problems with them. I still wonder what chance may be that the answer would be "don't call __repr__ then". Speaking of technical side, it's indeed sad at times that Python's __str__/__repr__ require to materialize entire representation in memory. More frugal approach would be: def __stream_repr__(self, stream): stream.write("<MyObj ") # Produce the rest of representation piece-wise, calling # stream.write() stream.write(">") This can be a great economy of memory, for implementations which care about that (is that CPython?). That actually could provide means to address the posed problem, as "stream" can be a custom stream object, whose .write() method checks e.g. representation size or time budget, and if it's exceeded, throws an exception, which should be called at code which calls repr() in the first place. Other alternative could be turning __repr__ into generator (i.e., introducing __irepr__), but that's quite an expensive solution, given that generator instance need to be heap-allocated on each call, before it can be iterated over.
-- Best regards, Paul mailto:pmiscml@gmail.com

Well, about pprint, the solution is not easy... but about debugging, pytest for example does not show the full repr of an object, if the object is too large. You have to run pytest with the flag -v or -vv. Maybe that code could be moved in a separate little library. Another simple solution could be to not call repr for nested objects, maybe introducing a parameter (deep=True?) I'm thinking about github: when you see a diff, if the file is heavily modified, the diff is not shown. But you can click and see. This way, a debugger could cache the repr of an object (and its nested objects separately) and see if it's too big. If so, it could return the generic repr (<__main__.A object at 0x7fb3ce417e50>) and offer you to click if you want to see the full contents. On Sat, 25 Jul 2020 at 16:23, Paul Sokolovsky <pmiscml@gmail.com> wrote:

On 26/07/20 1:34 am, Elizabeth Shashkova wrote:
Seems to me it would be better for the debugger to take a conservative approach here, and assume that the repr *will* be big and/or slow unless it's a type that it *does* know about. All other objects should be displayed using something similar to object.__repr__ unless the user requests otherwise. -- Greg
participants (14)
-
2QdxY4RzWzUUiLuE@potatochowder.com
-
Alex Hall
-
Bernat Gabor
-
Chris Angelico
-
Elizabeth Shashkova
-
Eric V. Smith
-
Greg Ewing
-
Guido van Rossum
-
Gábor Bernát
-
Joao S. O. Bueno
-
Marco Sulla
-
Paul Moore
-
Paul Sokolovsky
-
Serhiy Storchaka