[Tutor] Is there a simpler way to remove from a set?

Sat May 8 18:21:36 EDT 2021

On 08/05/2021 00.04, Leam Hall wrote:
> On 5/7/21 6:57 AM, Alan Gauld via Tutor wrote:
>> On 07/05/2021 01:52, Leam Hall wrote:
>>
>>> def build_dir_set(my_list, exclude_dirs):
>>>     my_set  = set()
>>>     for item in my_list:
>>>       path = Path(item)
>>>       parent = str(path.parent)
>>>       my_set.add(parent)
>>>     my_new_set = set()
>>>     for exclude_dir in exclude_dirs:
>>>       for parent in my_set:
>>>         if re.match(exclude_dir, parent):
>>>           my_new_set.add(parent)
>>>     return my_set - my_new_set

...

>>
>> Couldn't this be a set generator?

This Socratic-question offers solid advice. Please research "generators"
(if haven't already) ...

> The process I'm working on is documented:
> https://github.com/LeamHall/admin_tools/blob/master/docs/find_copies.txt
> 
> I started playing with your suggestions. One big change is pulling the
> logic for getting the parent directory into its own method. That means I
> can use the "full_set - exclude_set" for more than just directories. I
> also added a "match" parameter. When matching files, the match must be
> exact, not just close.
> 
> The "seed.list" is made by using a "locate" on a few files. The code
> will eventually go through the seed.list, get the directories, and then
> walk down those directories to add any sub-directories. It will also
> search for files in those directories, and get their size. The
> master_data dict will have keys of file names. Each of those will be a
> dict of file sizes, and each of those will be a set of directories where
> that file name of that file size resides.
> 
> When it comes time to use the --purge option, the file names and paths
> will be put back together and the files removed. And I will have a lot
> more space on my hard drives.

Are you allowing yourself to be 'railroaded' or 'blinkered' by the spec?

What I mean by this, is that the specifications have been written in a
particular way, with the idea of presenting the coder with as clear a
picture as possible - covering 'all the bases' in a step-by-step,
topic-by-topic fashion.

However, the spec does not set out to be, and should not, be seen as a
'map' of code-structure - nor as some sort of top-level description of
the solution-code. That is not to say that it shouldn't be included
in/with the code, but that once the coder has understood the problem to
be solved, the spec becomes a template for acceptance testing rather
than code-writing!

If my quick scan (mea culpa!) is correct, the required action comes down
to "purge unwanted files?directories". NB the word "purge" should be
understood as wider than a simple 'del filename', including logging, etc.

In the ?good, old, days our training was to start with the "output", use
that to map the "process", and finally to work backwards to the "input"
required to achieve the objective. NB it's not so 'true' these days with
real-time, screen interactions, etc; but it will work well in the design
this application.

So, (at the risk of sounding like kindergarten) the first code to write
is (if following TDD, is a test to verify) Python code which given a
filename, deletes same from the file-system/device. QED!

Now (working backwards) the question becomes, before purging, 'which
filename(s) should be purged?'.

Let me restate this as pseudo-code:

    give me the next filename (to be purged)
    purge filename

Here's where generators come into their own. Instead of producing sets
(or any Python collection) and then refining same, before
working-through the actual function(s) of the program; a generator will
respond with the next 'subject', until there are no more - and then
tidily concludes any looping mechanism.

At some stage we can add reporting. Does this require an extra piece of
data? Remember that generators can "yield" a data-structure not merely a
single value, eg the ComSc example of Fibonacci Numbers. Thus a "switch"
which can be used to inform the reporting, alongside the filename. In
this manner we can amend the above pseudo-code to:

    give me the next filename and switch
    log filename (labelled according to switch)
    purge filename according to switch

NB you may prefer to 'act' first, and 'log' second.

Jumping ahead a few steps, notice that the PSL's directory-traversal
facilities all are/have the appearance of a generator - giving you one
directory or file per loop!

The 'mainline' has become quite "clean" (and thus 'readable'), and the
'dirty details' can indeed be "put under the rug" of well-named
functions, each performing a unit of work, and contributing to the
generator's results...

YMMV!
-- 
-- 
Regards,
=dn