[Python-Dev] PEP 454 (tracemalloc): new minimalist version

Fri Oct 18 19:56:30 CEST 2013

Hi,

I'm happy to see this move forward!

> API
> ===
>
> Main Functions
> --------------
>
> ``clear_traces()`` function:
>
>     Clear traces and statistics on Python memory allocations, and reset
>     the ``get_traced_memory()`` counter.

That's nitpicking, but how about just ``reset()`` (I'm probably biased
by oprofile's opcontrol --reset)?

> ``get_stats()`` function:
>
>     Get statistics on traced Python memory blocks as a dictionary
>     ``{filename (str): {line_number (int): stats}}`` where *stats* in a
>     ``(size: int, count: int)`` tuple, *filename* and *line_number* can
>     be ``None``.

It's probably obvious, but you might want to say once what *size* and
*count* represent (and the unit for *size*).

> ``get_tracemalloc_memory()`` function:
>
>     Get the memory usage in bytes of the ``tracemalloc`` module as a
>     tuple: ``(size: int, free: int)``.
>
>     * *size*: total size of bytes allocated by the module,
>       including *free* bytes
>     * *free*: number of free bytes available to store data

What's *free* exactly? I assume it's linked to the internal storage
area used by tracemalloc itself, but that's not clear at all.

Also, is the tracemalloc overhead included in the above stats (I'm
mainly thinking about get_stats() and get_traced_memory()?
If yes, I find it somewhat confusing: for example, AFAICT, valgrind's
memcheck doesn't report the memory overhead, although it can be quite
large, simply because it's not interesting.

> Trace Functions
> ---------------
>
> ``get_traceback_limit()`` function:
>
>     Get the maximum number of frames stored in the traceback of a trace
>     of a memory block.
>
>     Use the ``set_traceback_limit()`` function to change the limit.

I didn't see anywhere the default value for this setting: it would be
nice to write it somewhere, and also explain the rationale (memory/CPU
overhead...).

> ``get_object_address(obj)`` function:
>
>     Get the address of the main memory block of the specified Python object.
>
>     A Python object can be composed by multiple memory blocks, the
>     function only returns the address of the main memory block.

IOW, this should return the same as id() on CPython? If yes, it could
be an interesting note.

> ``get_object_trace(obj)`` function:
>
>     Get the trace of a Python object *obj* as a ``(size: int,
>     traceback)`` tuple where *traceback* is a tuple of ``(filename: str,
>     lineno: int)`` tuples, *filename* and *lineno* can be ``None``.

I find the "trace" word confusing, so it might be interesting to add a
note somewhere explaining what it is ("callstack leading to the object
allocation", or whatever).

Also, this function leaves me a mixed feeling: it's called
get_object_trace(), but you also return the object size - well, a
vague estimate thereof. I wonder if the size really belongs here,
especially if the information returned isn't really accurate: it will
be for an integer, but not for e.g. a list, right? How about just
using sys.getsizeof(), which would give a more accurate result?

> ``get_trace(address)`` function:
>
>     Get the trace of a memory block as a ``(size: int, traceback)``
>     tuple where *traceback* is a tuple of ``(filename: str, lineno:
>     int)`` tuples, *filename* and *lineno* can be ``None``.
>
>     Return ``None`` if the ``tracemalloc`` module did not trace the
>     allocation of the memory block.
>
>     See also ``get_object_trace()``, ``get_stats()`` and
>     ``get_traces()`` functions.

Do you have example use cases where you want to work with a raw addresses?

> Filter
> ------
>
> ``Filter(include: bool, pattern: str, lineno: int=None, traceback:
> bool=False)`` class:
>
>     Filter to select which memory allocations are traced. Filters can be
>     used to reduce the memory usage of the ``tracemalloc`` module, which
>     can be read using the ``get_tracemalloc_memory()`` function.
>
> ``match(filename: str, lineno: int)`` method:
>
>     Return ``True`` if the filter matchs the filename and line number,
>     ``False`` otherwise.
>
> ``match_filename(filename: str)`` method:
>
>     Return ``True`` if the filter matchs the filename, ``False`` otherwise.
>
> ``match_lineno(lineno: int)`` method:
>
>     Return ``True`` if the filter matchs the line number, ``False``
>     otherwise.
>
> ``match_traceback(traceback)`` method:
>
>     Return ``True`` if the filter matchs the *traceback*, ``False``
>     otherwise.
>
>     *traceback* is a tuple of ``(filename: str, lineno: int)`` tuples.

Are those ``match`` methods really necessary for the end user, i.e.
are they worth being exposed as part of the public API?

> StatsDiff
> ---------
>
> ``StatsDiff(differences, old_stats, new_stats)`` class:
>
>     Differences between two ``GroupedStats`` instances.
>
>     The ``GroupedStats.compare_to()`` method creates a ``StatsDiff``
>     instance.
>
> ``sort()`` method:
>
>     Sort the ``differences`` list from the biggest difference to the
>     smallest difference. Sort by ``abs(size_diff)``, *size*,
>     ``abs(count_diff)``, *count* and then by *key*.
>
> ``differences`` attribute:
>
>     Differences between ``old_stats`` and ``new_stats`` as a list of
>     ``(size_diff, size, count_diff, count, key)`` tuples. *size_diff*,
>     *size*, *count_diff* and *count* are ``int``. The key type depends
>     on the ``GroupedStats.group_by`` attribute of ``new_stats``: see the
>     ``Snapshot.top_by()`` method.
>
> ``old_stats`` attribute:
>
>     Old ``GroupedStats`` instance, can be ``None``.
>
> ``new_stats`` attribute:
>
>     New ``GroupedStats`` instance.

Why keep references to ``old_stats`` and ``new_stats``?
datetime.timedelta doesn't keep references to the date objects it was
computed from.

Also, if you sort the difference by default (which is a sensible
choice), then the StatsDiff becomes pretty much useless, since you
would just keep its ``differences`` attribute (sorted).

> Snapshot
> --------
>
> ``Snapshot(timestamp: datetime.datetime, traces: dict=None, stats:
> dict=None)`` class:
>
>     Snapshot of traces and statistics on memory blocks allocated by Python.

I'm confused.
Why are get_trace(), get_object_trace(), get_stats() etc not methods
of a Snapshot object?
Is it because you don't store all the necessary information in a
snapshot, or are they just some sort of shorthands, like:
stats = get_stats()
vs
snapshot = Snapshot.create()
stats = snapshot.stats

> ``write(filename)`` method:
>
>     Write the snapshot into a file.

I assume it's in a serialized form, only readable by Snapshort.load() ?
BTW, it's a nitpick and debatable, but write()/read() or load()/dump()
would be more consistent (see e.g. pickle's load/dump).

> Metric
> ------
>
> ``Metric(name: str, value: int, format: str)`` class:
>
>     Value of a metric when a snapshot is created.

Alright, what's a metric again ;-) ?

I don't know if it's customary, but having short examples would IMO be nice.

cf