Finding duplicate file names and modifying them based on elements of the path
Larry.Martell@gmail.com
larry.martell at gmail.com
Thu Jul 19 14:52:09 EDT 2012
On Jul 18, 4:49 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> "Larry.Mart... at gmail.com" <larry.mart... at gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy. The whole
> specification seems like an antipattern though. You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not. You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
> location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
> groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.
Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.
reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):
for file, file_group in groupby(reports_list, lambda x: x[0]):
# if file is unique in file_group do nothing, but how can I tell
if file is unique?
for tool, tool_group in groupby(file_group, lambda x: x[1]):
# if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):
You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?
Thanks!
-larry
More information about the Python-list
mailing list