weirdness with list()
Peter Otten
__peter__ at web.de
Sun Feb 28 04:51:56 EST 2021
On 28/02/2021 01:17, Cameron Simpson wrote:
> I just ran into a surprising (to me) issue with list() on an iterable
> object.
>
> My object represents an MDAT box in an MP4 file: it is the ludicrously
> large data box containing the raw audiovideo data; for a TV episode it
> is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons,
> I do not always want to load that into memory, or even read the data
> part at all when scanning an MP4 file, for example to recite its
> metadata.
>
> So my parser has a "skip" mode where it seeks straight past the data,
> but makes a note of its length in bytes. All good.
>
> That length is presented via the object's __len__ method, because I want
> to know that length later and this is a subclass of a suite of things
> which return their length in bytes this way.
>
> So, to my problem:
>
> I've got a walk method which traverses the hierarchy of boxes in the MP4
> file. Until some minutes ago, it looked like this:
>
> def walk(self):
> subboxes = list(self)
> yield self, subboxes
> for subbox in subboxes:
> if isinstance(subbox, Box):
> yield from subbox.walk()
>
> somewhat like os.walk does for a file tree.
>
> I noticed that it was stalling, and investigation revealed it was
> stalling at this line:
>
> subboxes = list(self)
>
> when doing the MDAT box. That box (a) has no subboxes at all and (b) has
> a very large __len__ value.
>
> BUT... It also has a __iter__ value, which like any Box iterates over
> the subboxes. For MDAT that is implemented like this:
>
> def __iter__(self):
> yield from ()
>
> What I was expecting was pretty much instant construction of an empty
> list. What I was getting was a very time consuming (10 seconds or more)
> construction of an empty list.
>
> I believe that this is because list() tries to preallocate storage. I
> _infer_ from the docs that this is done maybe using
> operator.length_hint, which in turn consults "the actual length of the
> object" (meaning __len__ for me?), then __length_hint__, then defaults
> to 0.
>
> I've changed my walk function like so:
>
> def walk(self):
> subboxes = []
> for subbox in self:
> subboxes.append(subbox)
> ##subboxes = list(self)
list(iter(self))
should work, too. It may be faster than the explicit loop, but also
defeats the list allocation optimization.
> and commented out the former list(self) incantation. This is very fast,
> because it makes an empty list and then appends nothing to it. And for
> your typical movie file this is fine, because there are never _very_
> many immediate subboxes anyway.
>
> But is there a cleaner way to do this?
>
> I'd like to go back to my former list(self) incantation, and modify the
> MDAT box class to arrange something efficient. Setting __length_hint__
> didn't help: returning NotImplemeneted or 0 had no effect, because
> presumably __len__ was consulted first.
>
> Any suggestions? My current approach feels rather hacky.
>
> I'm already leaning towards making __len__ return the number of subboxes
> to match the iterator, especially as on reflection not all my subclasses
> are consistent about __len__ meaning the length of their binary form;
> I'm probably going to have to fix that - some subclasses are actually
> namedtuples where __len__ would be the field count. Ugh.
>
> Still, thoughts? I'm interested in any approaches that would have let me
> make list() fast while keeping __len__==binary_length.
>
> I'm accepting that __len__ != len(__iter__) is a bad idea now, though.
Indeed. I see how that train wreck happened -- but the weirdness is not
the list behavior.
Maybe you can capture the intended behavior of your class with two
classes, a MyIterable without length that can be converted into MyList
as needed.
More information about the Python-list
mailing list