weirdness with list()

Sun Feb 28 04:51:56 EST 2021

On 28/02/2021 01:17, Cameron Simpson wrote:
> I just ran into a surprising (to me) issue with list() on an iterable
> object.
> 
> My object represents an MDAT box in an MP4 file: it is the ludicrously
> large data box containing the raw audiovideo data; for a TV episode it
> is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons,
> I do not always want to load that into memory, or even read the data
> part at all when scanning an MP4 file, for example to recite its
> metadata.
> 
> So my parser has a "skip" mode where it seeks straight past the data,
> but makes a note of its length in bytes. All good.
> 
> That length is presented via the object's __len__ method, because I want
> to know that length later and this is a subclass of a suite of things
> which return their length in bytes this way.
> 
> So, to my problem:
> 
> I've got a walk method which traverses the hierarchy of boxes in the MP4
> file. Until some minutes ago, it looked like this:
> 
>    def walk(self):
>      subboxes = list(self)
>      yield self, subboxes
>      for subbox in subboxes:
>        if isinstance(subbox, Box):
>          yield from subbox.walk()
> 
> somewhat like os.walk does for a file tree.
> 
> I noticed that it was stalling, and investigation revealed it was
> stalling at this line:
> 
>      subboxes = list(self)
> 
> when doing the MDAT box. That box (a) has no subboxes at all and (b) has
> a very large __len__ value.
> 
> BUT... It also has a __iter__ value, which like any Box iterates over
> the subboxes. For MDAT that is implemented like this:
> 
>      def __iter__(self):
>          yield from ()
> 
> What I was expecting was pretty much instant construction of an empty
> list. What I was getting was a very time consuming (10 seconds or more)
> construction of an empty list.
> 
> I believe that this is because list() tries to preallocate storage. I
> _infer_ from the docs that this is done maybe using
> operator.length_hint, which in turn consults "the actual length of the
> object" (meaning __len__ for me?), then __length_hint__, then defaults
> to 0.
> 
> I've changed my walk function like so:
> 
>    def walk(self):
>      subboxes = []
>      for subbox in self:
>        subboxes.append(subbox)
>      ##subboxes = list(self)

list(iter(self))

should work, too. It may be faster than the explicit loop, but also 
defeats the list allocation optimization.

> and commented out the former list(self) incantation. This is very fast,
> because it makes an empty list and then appends nothing to it. And for
> your typical movie file this is fine, because there are never _very_
> many immediate subboxes anyway.
> 
> But is there a cleaner way to do this?
> 
> I'd like to go back to my former list(self) incantation, and modify the
> MDAT box class to arrange something efficient. Setting __length_hint__
> didn't help: returning NotImplemeneted or 0 had no effect, because
> presumably __len__ was consulted first.
> 
> Any suggestions? My current approach feels rather hacky.
> 
> I'm already leaning towards making __len__ return the number of subboxes
> to match the iterator, especially as on reflection not all my subclasses
> are consistent about __len__ meaning the length of their binary form;
> I'm probably going to have to fix that - some subclasses are actually
> namedtuples where __len__ would be the field count. Ugh.
> 
> Still, thoughts? I'm interested in any approaches that would have let me
> make list() fast while keeping __len__==binary_length.
> 
> I'm accepting that __len__ != len(__iter__) is a bad idea now, though.

Indeed. I see how that train wreck happened -- but the weirdness is not 
the list behavior.

Maybe you can capture the intended behavior of your class with two 
classes, a MyIterable without length that can be converted into MyList 
as needed.