Changing strings in files
Manfred Lotz
ml_news at posteo.de
Tue Nov 10 13:48:54 EST 2020
On Tue, 10 Nov 2020 22:08:54 +1100
Cameron Simpson <cs at cskk.id.au> wrote:
> On 10Nov2020 10:07, Manfred Lotz <ml_news at posteo.de> wrote:
> >On Tue, 10 Nov 2020 18:37:54 +1100
> >Cameron Simpson <cs at cskk.id.au> wrote:
> >> Use os.walk for trees. scandir does a single directory.
> >
> >Perhaps better. I like to use os.scandir this way
> >
> >def scantree(path: str) -> Iterator[os.DirEntry[str]]:
> > """Recursively yield DirEntry objects (no directories)
> > for a given directory.
> > """
> > for entry in os.scandir(path):
> > if entry.is_dir(follow_symlinks=False):
> > yield from scantree(entry.path)
> >
> > yield entry
> >
> >Worked fine so far. I think I coded it this way because I wanted the
> >full path of the file the easy way.
>
> Yes, that's fine and easy to read. Note that this is effectively a
> recursive call though, with the associated costs:
>
> - a scandir (or listdir, whatever) has the directory open, and holds
> it open while you scan the subdirectories; by contrast os.walk only
> opens one directory at a time
>
> - likewise, if you're maintaining data during a scan, that is held
> while you process the subdirectories; with an os.walk you tend to do
> that and release the memory before the next iteration of the main
> loop (obviously, depending exactly what you're doing)
>
> However, directory trees tend not to be particularly deep, and the
> depth governs the excess state you're keeping around.
>
Very interesting information. Thanks a lot for this. I will take a
closer look at os.walk.
> >> > - check if a file is a text file
> >>
> >> This requires reading the entire file. You want to check that it
> >> consists entirely of lines of text. In your expected text encoding
> >> - these days UTF-8 is the common default, but getting this correct
> >> is essential if you want to recognise text. So as a first cut,
> >> totally untested:
> >>
> >> ...
> >
> >The reason I want to check if a file is a text file is that I don't
> >want to try replacing patterns in binary files (executable binaries,
> >archives, audio files aso).
>
> Exactly, which is why you should not trust, say, the "file" utility.
> It scans only the opening part of the file. Great for rejecting
> files, but not reliable for being _sure_ about the whole file being
> text when it doesn't reject.
>
> >Of course, to make this nicely work some heuristic check would be the
> >right thing (this is what file command does). I am aware that an
> >heuristic check is not 100% but I think it is good enough.
>
> Shrug. That is a risk you must evaluate yourself. I'm quite paranoid
> about data loss, myself. If you've got backups or are working on
> copies the risks are mitigated.
>
> You could perhaps take a more targeted approach: do your target files
> have distinctive file extensions (for example, all the .py files in a
> source tree).
>
There are some distinctive file extensions. The reason I am satisfieg
with heuristics is that the string to change is pretty long so that
there is no real danger if I try to change in a binary file because
that string it not to be found in binary files.
The idea to skip binary files was simply to save time.
--
Manfred
More information about the Python-list
mailing list