[Python-ideas] Re: pickle.reduce and deconstruct

Feb. 9, 2020

      Daniel Spitz wrote:
...
On Fri, Feb 7, 2020 at 1:51 PM Andrew Barnert abarnert@yahoo.com wrote:
...
That being said, if you have a use for the object graph walk, that
shouldn’t be too hard to write. It’s just that there
are nontrivial design
issues to answer. Do we need to expose the graph itself in some form, or
just the memoized walk iterator? Is an Iterator the right way to expose it,
as opposed to something like the Smalltalk-ish walker classes in the AST
module?
(See my comments further down for more on this.) An entire graph structure
isn't necessary for my use case. An iterator over values would be
insufficient, because it wouldn't assist you in building up the replacement
structure. An iterator over node objects could work but it might be weird
to update them as you iterate. I think a walker is probably most
appropriate.
I think the simplest interface might just look like `copy.deepcopy` but with an extra `transformer` argument, which is called on each value. The `ast.NodeTransformer` style doesn't really work, because you have to register transformers for an arbitrary set of types that probably need qualified names (or might not even have public names), rather than for a fixed set of types from the `ast` module. So you're going to need a single `transform` method that explicitly switches on types (or, maybe better, uses `@singledispatch` to do it for you), at which point the class is just extra boilerplate.

The first question is, should this be top-down (like doing `copy.deepcopy`, but calling `transformer` on each original value before recursively deep-copying it) or bottom-up (like doing `pickle.loads(pickle.dumps())`, but calling `transformer` on each reconstructed value)? I think bottom-up is more generally useful, but there are some cases where it seems wrong. For example, if you want to transform every `Spam` into an `Eggs`, and `Spam` isn't reducible, top-down could still easily work, but bottom-up wouldn't. I suppose it could be an option (like `os.walk`), or even give you both (like `ElementTree.iterparse`)?

Also, it seems a bit weird that types that are deepcopyable with `__deepcopy__` but not pickleable—or that are both, but `__deepcopy__` is a major optimization—can't use `__deepcopy__`. But I'm not sure there's any way we could use that protocol. The `__deepcopy__` method just takes `self, memo` and usually works by calling `copy.deepcopy(attr, memo)` on all of its attributes, and there's no obvious clean way to swizzle that into calling our transforming-deepcopy with our transformer instead. If it's important to make that work, then it's worth checking whether I'm right about that rather than just accepting that I didn't see an obvious clean way…

One more question: You don't want to see duplicate values. If a value has been transformed, you want the replacement value each time it shows up again then? Or should a transformed value not go into the memo cache, so if the same value appears again it can be transformed differently? Or does that also need to be an option?

Finally, I think we'd want the `deepcopy` style of memoization, where we just rely on `id` to be persistent through the entire copy (which includes keeping alive any values that might get replaced along the way), not the `pickle` style that has to construct and using persistent IDs. So we don't need to use any of the pickle ID machinery.

Anyway, if you wanted to implement this today, you'd need access to multiple (and undocumented, and periodically changing) things. But I think all we'd need `copy` to expose to make this doable without being fragile is:

    def reconstruct(x, memo, reduction, *, recurse=deepcopy):
        return _reconstruct(x, memo, *reduction, deepcopy=recurse)

    def copier(cls):
        if copier := _deepcopy_dispatch.get(cls):
            return copier
        if issubclass(cls, type):
            return _deepcopy_atomic

And add a `deepcopy=deepcopy` parameter to `_deepcopy_atomic` and `_deepcopy_method`, and the latter would have to use it instead of the global, just like the other copier functions. Or, maybe better, rename it to `recurse` in all of them.

And with those changes to `copy`, plus the `pickle.reduce` function I suggested earlier, I believe you can write `transform` yourself like this (quick&dirty hack based on existing `copy.deepcopy`, which assumes you want bottom-up, and transformed values to be memoized, and no attempt at `__deepcopy__` support):

    def transform(x, memo=None, *, transformer, _nil=[]):
        if memo is None:
            memo = {}

        recurse = functools.partial(transform, transformer=transformer)

        d = id(x)
        y = memo.get(d, _nil)
        if y is not _nil:
            return y

        # y = transform(y) # top-down walk; untested

        cls = type(x)

        if copier := copy.copier(cls):
            y = copier(x, memo, recurse=recurse)
        else:
            rv = pickle.reduce(x)
            # not sure what this is for, but copy.deepcopy does it...
            if isinstance(rv, str):
                y = x
            else:
                y = copy.reconstruct(x, memo, rv, recurse=recurse)

        y = transformer(y) # bottom-up walk

        # If is its own copy, don't memoize.
        if y is not x:
            memo[d] = y
            memo.setdefault(id(memo), []).append(x) # make sure x lives at least as long as d
        return y

I think this is sufficiently non-magical for someone to write as user/third-party-library code. 

And I don't think any of this puts unacceptable constraints on future versions of Python or other implementations.

The one possible issue is the `memo` thing. What if a future version of `copy` wanted to use something different from a plain dict for memoization, or wanted to use the dict differently in a way that our `setdefault` trick would break? I don't think that's an issue, but if it is, one more thing to expose:

    class Memo(dict):
        def keep_alive(self, x):
            self.setdefault(id(self), []).append(x)

And then the user `transform` function has to use `copy.Memo()` in place of `{}`, and call `memo.keep_alive(x)` instead of doing it manually.

Here's an example of using it:

    @dataclass
    class Spam:
        x: int
        y: float
        z: list

    @functools.singledispatch
    def transformator(x):
        return x

    @transformator.register
    def _(x: int):
        return str(x)

    spam = Spam(2, 3.0, z=[1, 2.0, 'abc'])
    print(transform(spam, transformer=transformator))

This should print:

    Spam(x='2', y=3.0, z=['1', 2.0, 'abc'])