Finding duplicate file names and modifying them based on elements of the path

Larry.Martell@gmail.com larry.martell at gmail.com
Fri Jul 20 03:01:36 CEST 2012


On Jul 19, 1:43 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> "Larry.Mart... at gmail.com" <larry.mart... at gmail.com> writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
>   for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
>      gs = list(group)  # convert iterator to a list
>      n = len(gs)       # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state.  It bumps a pointer and maybe consumes some input
> every time you advance it.  In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

> This isn't as bad as it sounds once you get used to it, but it can be
> a source of frustration at first.
>
> BTW, if you just want to count the elements of an iterator (while
> consuming it),
>
>      n = sum(1 for x in xs)
>
> counts the elements of xs without having to expand it into an in-memory
> list.
>
> Itertools really makes Python feel a lot more expressive and clean,
> despite little kinks like the above.




More information about the Python-list mailing list