Finding duplicate file names and modifying them based on elements of the path

Larry.Martell@gmail.com larry.martell at gmail.com
Thu Jul 19 21:00:46 CEST 2012


On Jul 18, 4:49 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> "Larry.Mart... at gmail.com" <larry.mart... at gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

I replied to this before, but I don't see, so if this is a duplicate,
sorry.

Thanks for the reply Paul. I had not heard of itertools. It sounds
like just what I need for this. But I am having 1 issue - how do you
know how many items are in each group? Without knowing that I have to
either make 2 passes through the data, or else work on the previous
item (when I'm in an iteration after the first then I know I have
dups). But that very quickly gets crazy with trying to keep the
previous values.



More information about the Python-list mailing list