[Tutor] Is there a simpler way to remove from a set?

Leam Hall leamhall at gmail.com
Fri May 7 08:04:37 EDT 2021


On 5/7/21 6:57 AM, Alan Gauld via Tutor wrote:
> On 07/05/2021 01:52, Leam Hall wrote:
> 
>> def build_dir_set(my_list, exclude_dirs):
>>     my_set  = set()
>>     for item in my_list:
>>       path = Path(item)
>>       parent = str(path.parent)
>>       my_set.add(parent)
>>     my_new_set = set()
>>     for exclude_dir in exclude_dirs:
>>       for parent in my_set:
>>         if re.match(exclude_dir, parent):
>>           my_new_set.add(parent)
>>     return my_set - my_new_set
>> ###
>>
>> I couldn't remove from the set during iteration, python complained.
> 
> Can you expand on that, show us the code?
> It should be possible without building a second set.
> But it may be because you are trying to delete something
> from the set you are iterating over?
> 
>        for parent in my_set:
>          if re.match(dir, parent):
>            my_new_set.add(parent)
> 
> Couldn't this be a set generator?
> 
> parent_set = set(parent for parent in my_set if re.match(dir, parent))
> 
> Which should be slightly faster...
> 
> And couldn't you use  str.startswith() instead of re.match()
> Which should also be slightly faster
> ie
> 
> exclude_set = set()
> for dir in exlude_dirs:
>     matches = set(parent for parent in parent_set
>                          if parent.startswith(dir))
>     exclude_set.add(matches)
> return parent_set-exclude_set
> 
> Which is slightly shorter if nothing else! :-)
> 

Well, Alan, you did ask!   :)

The process I'm working on is documented:

https://github.com/LeamHall/admin_tools/blob/master/docs/find_copies.txt

I started playing with your suggestions. One big change is pulling the 
logic for getting the parent directory into its own method. That means I 
can use the "full_set - exclude_set" for more than just directories. I 
also added a "match" parameter. When matching files, the match must be 
exact, not just close.

The "seed.list" is made by using a "locate" on a few files. The code 
will eventually go through the seed.list, get the directories, and then 
walk down those directories to add any sub-directories. It will also 
search for files in those directories, and get their size. The 
master_data dict will have keys of file names. Each of those will be a 
dict of file sizes, and each of those will be a set of directories where 
that file name of that file size resides.

When it comes time to use the --purge option, the file names and paths 
will be put back together and the files removed. And I will have a lot 
more space on my hard drives.

The actual, current, full code listing:

###
#!/usr/bin/env python3

# name:       find_copies.py
# version:    0.0.1
# date:       20210506
# desc:       Tool to help clean up old versions of files.

from pathlib import Path

def build_set_from_file(file):
   '''
   Given a file that has been verified to exist, make a set
   from the lines in the file.

   '''
   my_set = set()
   if Path(file).is_file():
     readfile = open(file, 'r')
     for line in readfile.readlines():
       line = line.strip()
       my_set.add(line)
   return my_set

def build_dir_set(dirs):
   '''
   Given a list of full file paths, make a set of the parent directory
     names as strings.
   '''
   dir_set = set()
   for item in dirs:
     path    = Path(item)
     parent  = str(path.parent)
     dir_set.add(parent)
   return dir_set

def build_clean_set(full_set, exclude_list, match = 'full'):
   '''
   Given a set (or list), and a set/list of items to be excluded,
     return a set of the original set/list minus the excluded items.
   Optionally, match the exclusion on the entire string, or the
     beginning. If there's a use for it, then endswith, too.
   '''

   exclude_set = set()
   for exclude in exclude_list:
     for item in full_set:
       if ( ( match == 'starts' and item.startswith(exclude) ) or
           ( match == 'full' and item == exclude ) ):
         exclude_set.add(item)
   return full_set - exclude_set


if __name__ == '__main__':

   master_data = dict()

   # Need to verify if these files exist.
   exclude_files_file  = 'data/exclude_files.list'
   exclude_dirs_file   = 'data/excluded_dirs.list'
   seed_file_file      = 'data/seed.list'

   exclude_files       = build_set_from_file(exclude_files_file)
   exclude_dirs        = build_set_from_file(exclude_dirs_file)
   seed_files          = build_set_from_file(seed_file_file)
   seed_dirs           = build_dir_set(seed_files)
   search_dirs         = build_clean_set(seed_dirs, exclude_dirs, 'starts')

   print("There are {} files in the exclude 
set.".format(len(exclude_files)))
   print("There are {} dirs in the exclude set.".format(len(exclude_dirs)))
   print("There are {} dirs in the seed set.".format(len(seed_dirs)))
   for item in sorted(search_dirs):
     print(item)

###
-- 
Site Reliability Engineer  (reuel.net/resume)
Scribe: The Domici War     (domiciwar.net)
General Ne'er-do-well      (github.com/LeamHall)


More information about the Tutor mailing list