[Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

Tue Sep 8 12:53:24 EDT 2009

On Mon, 2009-09-07 at 16:53 +0100, Chris Withers wrote:
> Sverker Nilsson wrote:
> > I hope the new loadall method as I wrote about before will resolve this.
> > 
> > def loadall(self,f):
> >     ''' Generates all objects from an open file f or a file named f'''
> >     if isinstance(f,basestring):
> >         f=open(f)
> >     while True:
> >         yield self.load(f)
> 
> It would be great if load either returned just one result ever, or 
> properly implemented the iterator protocol, rather than half 
> implementing it...
> 
Agreed, this is arguably a bug or at least a misfeature, as also Raymond
Hettinger remarked, it is not normal for a normal function to raise
StopIteration.

But I don't think I would want to risk breaking someone's code just for
this when we could just add a new method.

> > Should we call it loadall? It is a generator so it doesn't really load
> > all immedietally, just lazily. Maybe call it iload? Or redefine load,
> > but that might break existing code so would not be good.
> 
> loadall works for me, iload doesn't.
> 

Or we could have an option to hpy() to redefine load() as loadall(), but
I think it is cleaner (and easier) to just define a new method...

Settled then? :-)

> >> Minor rant, why do I have to instantiate a
> >> <class 'guppy.heapy.Use._GLUECLAMP_'>
> >> to do anything with heapy?
> >> Why doesn't heapy just expose load, dump, etc?
> > 
> > Basically, the need for the h=hpy() idiom is to avoid any global
> > variables. 
> 
> Eh? What's h then? (And h will reference whatever globals you were 
> worried about, surely?)

h is what you make it to be in the context you create it; you can make
it either a global variable, a local variable, or an object attribute.

Interactively, I guess one tends to have it as a global variable, yes.
But it is a global variable you created and responds for yourself, and
there are no other global variables behind the scene but the ones you
create yourself (also possibly the results of heap() etc as you store
them in your environment).

If making test programs, I would not use global variables but instead
would tend to have h as a class attribute in a test class, eg as in
UnitTest. It could also be a local variable in a test function.

As the enclosing class or frame is deallocated, so is its attribute h
itself. There should be nothing that stays allocated in other modules
after one test (class) is done (other than some loaded modules
themselves, but I am talking about more severe data that can be hundreds
of megabytes or more).

> > Heapy uses some rather big internal data structures, to cache
> > such things as dict ownership. I didn't want to have all those things in
> > global variables. 
> 
> What about attributes of a class instance of some sort then?

They are already attributes of an instance: hpy() is a convenience
factory method that creates a top level instance for this purpose.

> > the other objects you created. Also, it allows for several parallel
> > invocations of Heapy.
> 
> When is that helpful?

For example, the setref() method sets a reference point somewhere in h.
Further calls to heap() would report only objects allocated after that
call. But you could use a new hpy() instance to see all objects again.

Multiple threads come to mind, where each thread would have its own
hpy() object. (Thread safety may still be a problem but at least it
should be improved by not sharing the hpy() structures.)

Even in the absence of multiple threads, you might have an outer
invocation of hpy() that is used for global analysis, with its specific
options, setref()'s etc, and inner invocations that make some local
analysis perhaps in a single method.

> > However, I am aware of the extra initial overhead to do h=hpy(). I
> > discussed this in my thesis. "Section 4.7.8 Why not importing Use
> > directly?" page 36, 
> > 
> > http://guppy-pe.sourceforge.net/heapy-thesis.pdf
> 
> I'm afraid, while I'd love to, I don't have the time to read a thesis...

But it is (an important) part of the documentation. For example it
contains the rationale and an introduction to the main categories such
as Sets, Kinds and EquivalenceRelations, and some usecases for example
how to seal a memory leak in a windowing program.

I'm afraid, while I'd love to, I don't have the time to duplicate the
thesis here...;-)

> > Try sunglasses:) (Well, I am aware of this, it was a
> > research/experimental system and could have some refactoring :-)
> 
> I would suggest creating a minimal system that allows you to do heap() 
> and then let other people build what they need from there. Simple is 
> *always* better...

Do you mean we should actually _remove_ features to create a new
standalone system?

I don't think that'd be meaningful.
You don't need to use anything else than heap() if you don't want to.

You are free to wrap functions as you find suitable; a minimal wrapper
module could be just like this:

# Module heapyheap
from guppy import hpy
h=hpy()
heap=heap()

Should we add some such module? In the thesis I discussed this already
and argued it was not worth the trouble. And I think it may be
confusing; as in Python, I think it is good that 'there is only one way
to do it'.

> >> Less minor rant: this applies to most things to do with heapy... Having 
> >> __repr__ return the same as __str__ and having that be a long lump of 
> >> text is rather annoying. If you really must, make __str__ return the big 
> >> lump of text but have __repr__ return a simple, short, item containing 
> >> the class, the id, and maybe the number of contained objects...
> > 
> > I thought it was cool to not have to use print but get the result
> > directly at the prompt.
> 
> That's fine, that's what __str__ is for. __repr__ should be short.

No, it's the other way around: __repr__ is used when evaluating directly
at the prompt.

> >> Hmmm, I'm sure there's a good reason why an item in a set has the exact 
> >> same class and iterface as a whole set?
> > 
> > Um, perhaps no very good reason but... a subset of a set is still a set,
> > isn't it?
> 
> Yeah, but an item in a set is not a set. __getitem__ should return an 
> item, not a subset...

Usually I think it is called an 'element' of a set rather than an
'item'. Python builtin sets can't even do indexing at all. I think it
was perceived that since the result of indexing to get at individual
elements would be ill-defined (depending on hashing and implementation)
it should not be supported at all.

Likewise, Heapy IdentitySet objects don't support indexing to get at the
elements directly. The index (__getitem__) method was available so I
used it to take the subset of the i'ths row in the partition defined by
its equivalence order.

To get at a specific element, you either have to somehow arrive at a
subset of length 1 (via eg the .byid equivalence relation ) and then
use .theone, or make a list of the .nodes attribute, both of which
methods would give somewhat ill-defined results.

The subset indexing, being the more well-defined operation, and also
IMHO more generally useful, thus got the honor to have the [] syntax.

> I really think that, by the sounds of it, what is currently implemented 
> as __getitem__ should be a `filter` or `subset` method on IdentitySets 
> instead...

It would just be another syntax. I don't see the conceptual problem
since e.g. indexing works just fine like this with strings.

> 
> > objects. Each row is still an IdentitySet, and has the same attributes.
> 
> Why? It's semantically different. 

No, it's semantically identical. :-)

Each row is an IdentitySet just like the top level set, but one which
happens to contain elements being of one particular kind as defined by
the equivalence relation in use. So it has only 1 row. The equivalence
relation can be changed by creating a new set by using some of
the .byxxx attribute: then the set could be made to contain many kinds
of objects again, getting more rows albeit the objects themselves don't
change.

>>> from guppy import hpy
>>> h=hpy()
>>> h.heap()
Partition of a set of 51045 objects. Total size = 3740412 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of
class)
     0  25732  50  1694156  45   1694156  45 str
     1  11709  23   450980  12   2145136  57 tuple
...
>>> _[0]
Partition of a set of 25732 objects. Total size = 1694156 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of
class)
     0  25732 100  1694156 100   1694156 100 str
>>> _.bysize
Partition of a set of 25732 objects. Total size = 1694156 bytes.
 Index  Count   %     Size   % Cumulative  % Individual Size
     0   4704  18   150528   9    150528   9        32
     1   3633  14   130788   8    281316  17        36
 ...

> .load() returns a set of measurements, 
> each measurement contains a set of something else, but I don't know what...
> 
For Stat objects, in analogy with IdentitySet, each row represents (a
statistical summary of) a subset, a block in the partition defined by
the classifying equivalence relation. The only special thing with this
sub - Stat object is that it happens to represent objects of only one
kind, as defined by the equivalence relation used when dump() ing it. So
it has only one subset in its own partition, one row in its
representation. Indexing (with [0]) returns itself.

Why would this warrant a new type?

> > This is also like Python strings work, there is no special character
> > type, a character is just a string of length 1.
> 
> Strings are *way* more simple in terms of what they are though...

I don't see why this matters.

Cheers,

Sverker

-- 
Expertise in Linux, embedded systems, image processing, C, Python...
        http://sncs.se