[Tutor] Is there a simpler way to remove from a set?
Leam Hall
leamhall at gmail.com
Fri May 7 08:04:37 EDT 2021
On 5/7/21 6:57 AM, Alan Gauld via Tutor wrote:
> On 07/05/2021 01:52, Leam Hall wrote:
>
>> def build_dir_set(my_list, exclude_dirs):
>> my_set = set()
>> for item in my_list:
>> path = Path(item)
>> parent = str(path.parent)
>> my_set.add(parent)
>> my_new_set = set()
>> for exclude_dir in exclude_dirs:
>> for parent in my_set:
>> if re.match(exclude_dir, parent):
>> my_new_set.add(parent)
>> return my_set - my_new_set
>> ###
>>
>> I couldn't remove from the set during iteration, python complained.
>
> Can you expand on that, show us the code?
> It should be possible without building a second set.
> But it may be because you are trying to delete something
> from the set you are iterating over?
>
> for parent in my_set:
> if re.match(dir, parent):
> my_new_set.add(parent)
>
> Couldn't this be a set generator?
>
> parent_set = set(parent for parent in my_set if re.match(dir, parent))
>
> Which should be slightly faster...
>
> And couldn't you use str.startswith() instead of re.match()
> Which should also be slightly faster
> ie
>
> exclude_set = set()
> for dir in exlude_dirs:
> matches = set(parent for parent in parent_set
> if parent.startswith(dir))
> exclude_set.add(matches)
> return parent_set-exclude_set
>
> Which is slightly shorter if nothing else! :-)
>
Well, Alan, you did ask! :)
The process I'm working on is documented:
https://github.com/LeamHall/admin_tools/blob/master/docs/find_copies.txt
I started playing with your suggestions. One big change is pulling the
logic for getting the parent directory into its own method. That means I
can use the "full_set - exclude_set" for more than just directories. I
also added a "match" parameter. When matching files, the match must be
exact, not just close.
The "seed.list" is made by using a "locate" on a few files. The code
will eventually go through the seed.list, get the directories, and then
walk down those directories to add any sub-directories. It will also
search for files in those directories, and get their size. The
master_data dict will have keys of file names. Each of those will be a
dict of file sizes, and each of those will be a set of directories where
that file name of that file size resides.
When it comes time to use the --purge option, the file names and paths
will be put back together and the files removed. And I will have a lot
more space on my hard drives.
The actual, current, full code listing:
###
#!/usr/bin/env python3
# name: find_copies.py
# version: 0.0.1
# date: 20210506
# desc: Tool to help clean up old versions of files.
from pathlib import Path
def build_set_from_file(file):
'''
Given a file that has been verified to exist, make a set
from the lines in the file.
'''
my_set = set()
if Path(file).is_file():
readfile = open(file, 'r')
for line in readfile.readlines():
line = line.strip()
my_set.add(line)
return my_set
def build_dir_set(dirs):
'''
Given a list of full file paths, make a set of the parent directory
names as strings.
'''
dir_set = set()
for item in dirs:
path = Path(item)
parent = str(path.parent)
dir_set.add(parent)
return dir_set
def build_clean_set(full_set, exclude_list, match = 'full'):
'''
Given a set (or list), and a set/list of items to be excluded,
return a set of the original set/list minus the excluded items.
Optionally, match the exclusion on the entire string, or the
beginning. If there's a use for it, then endswith, too.
'''
exclude_set = set()
for exclude in exclude_list:
for item in full_set:
if ( ( match == 'starts' and item.startswith(exclude) ) or
( match == 'full' and item == exclude ) ):
exclude_set.add(item)
return full_set - exclude_set
if __name__ == '__main__':
master_data = dict()
# Need to verify if these files exist.
exclude_files_file = 'data/exclude_files.list'
exclude_dirs_file = 'data/excluded_dirs.list'
seed_file_file = 'data/seed.list'
exclude_files = build_set_from_file(exclude_files_file)
exclude_dirs = build_set_from_file(exclude_dirs_file)
seed_files = build_set_from_file(seed_file_file)
seed_dirs = build_dir_set(seed_files)
search_dirs = build_clean_set(seed_dirs, exclude_dirs, 'starts')
print("There are {} files in the exclude
set.".format(len(exclude_files)))
print("There are {} dirs in the exclude set.".format(len(exclude_dirs)))
print("There are {} dirs in the seed set.".format(len(seed_dirs)))
for item in sorted(search_dirs):
print(item)
###
--
Site Reliability Engineer (reuel.net/resume)
Scribe: The Domici War (domiciwar.net)
General Ne'er-do-well (github.com/LeamHall)
More information about the Tutor
mailing list