[Python-ideas] find-like functionality in pathlib

Tue Dec 29 12:38:19 EST 2015

On Mon, Dec 28, 2015 at 2:43 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> sure, but having to write "little" wrappers for common needs is
> unfortunate...
>
>
> You're replying to me, not Guido, here...
>

I was intending to reply to the list :-)

> Anyway, if the only thing anyone will ever need is a handful of simple
> one-liners that even a novice could write, maybe it's reasonable to just
> add one to the docs to show how to do it, instead of adding them to the
> stdlib.
>

well, it's a four liner, yes? but I'm not sure i agree -- the simple things
should be simple. even if you can find the couple-liner in the docs, you've
still got a lot more overhead than calling a ready-to-go function.

and it's not like it'd be a heavy maintenance burden....

The problem isn't designing a nice walk API; it's integrating it with
>> pathlib.*
>
>
indeed -- I'd really like to see a *walk in pathlib itself.

But first you have to solve the problem that paragraph was all about: a
general-purpose walk API shouldn't be throwing away all that stat
information it wasted time fetching, but the pathlib module is designed
around Path objects that are always live, not snapshots. If Path.walk
yields something that isn't a Path, what's the point?

OK -- you've gotten out of my technical depth now.....so I'll just shut up.

But at the end of the day, if you've got the few-liner in the docs that
works, maybe it's OK that it's not optimized.....

I've been trying to use pathlib whenever I need, well, a path, but then I
find I almost immediately need to step out and use an os.path function, and
have to string-fy it anyway -- makes me wonder what the point is..

> I have the same impression as you, but, as Guido says, let's give it time
> before judging...

time good -- but also maybe some more work to make it easy to use with rest
of the stdlib. I will say that one thing that bugs me about the "old style"
os.path functions is that I find myself stringing tehm together, and that
gets really ugly fast:

my_path - os.path.join(os.path.split(something)[0], something_else)

here's where an OO interface is much nicer.

 And honestly, if open, os.walk, etc. aren't going to work with Path
>> objects,
>
>
but they should -- of course they should.....

So far things have gone the opposite direction: open requires strings, but
there's a Path.open method;

This sure feels to me like the wrong way to go -- too OO -heavy:

create a Path object, then use it to open a file. which is why we still
have the regular old open() that takes strings.

I just finished teaching an intro to Python class, using py3 for the first
time -- I found myself pointing students to pathlib, but then never using
it in any examples, etc. That may be my old habits, but I really think we
do have an ugly mix of APIs here.

> walk requires strings, but people are proposing a Path.walk method; etc.

well, walk "feels" to me like a path-y operation. whereas open() does not.

I'm not sure how that's supposed to extend to things like json.load or
NamedTemporaryFile.name.

exactly -- that's why open() doesn't feel path-y to me. you have all sorts
of places where you might want to open a file, and you want to open other
things as well. And I like APIs that let you pass in either an open
file-like object, OR a path -- so it seems allowing either a Path object or
a path-in-a-string would be good.

so my "proposal" is to go through the stdlib and add the ability to accept
a Path object everywhere a string path is accepted.

(hmm -- could you simply wrap str() around the input?)

My example: one of our sysadmins wanted a little script to go thorugh an
entire drive (Windows), and check if any paths were longer than 256
characters (Windows, remember..)

I came up with this:

def get_all_paths(start_dir='/'):
    for dirpath, dirnames, filenames in os.walk(start_dir):
        for filename in filenames:
            yield os.path.join(dirpath, filename)

too_long = []
for p in get_all_paths('/'):
    print("checking:", p)
    if len(p) > 255:
        too_long.append(p)
        print("Path too long!")

> Do you really want it to print out "Path too long!" hundreds of times?

well, not in production, no, but was nice to test -- also, in theory, there
shouldn't be many!

> If not, this is a lot more concise, and I think readable, with
comprehensions:

walk = os.walk(start_dir)
files = (os.path.join(root, file) for root, dirs, files in walk for file in
files)
too_long = (file for file in files if len(file) > 255)

thanks -- should have thought of that -- though that was to pass off to a
sysadmin that doesn't know much python -- harder for him to read??

> And now you've got a lazy Iterator over you too-long files.
> (If you need a > list, just use a listcomp instead of a genexpr in the
last step.)

yup -- probably I'd write it out to a file in the real use case. or stdout.

way too wordy!

I started with pathlib, but that just made it worse.

> If we had a Path.walk, I don't think it could be that much better than the
> original version,

sure -- the wordyness comes from the fact that you have to deal with dirs
and files separately.

> since the only thing Path can help with is making that join a bit
> shorter--and at the cost of having to convert to str to check len():

maybe another argument for why Path doesn't buy much over string paths...

> walk = start_path.Walk()
> files = (root / file for root, dirs, files in walk for file in files)
> too_long = (file for file in files if len(str(file)) > 255)

what I really want here is:

too_long = (filepath for filepath in Path(root) if len(filepath) > 255 )

I know python isn't a shell scripting language but it is a one liner in
powershell or bash, or....

As a side note, there's no Windows restriction to 255 _characters_, it's to
> 255 UTF-16 code points,

IIUC, Windows itself, nor ntfs has this restriction, but some older
utilities do -- really pathetic. And I asked our sysadmin about the unicode
issue, and he hasd no idea.

> just under 64K UTF-16 code points,

how is a codepoint different than a character???? I was wondering if it was
a bytes restriction or codepoint restriction?

> or 255 codepage bytes, depending on which API you use.

this is where it gets ugly -- who knows what API some utility is using???

So you really want something like len(file.encode('utf-16') / 2) > 255.

but can't some characters use more than 2 bytes in utf-16? or is that what
you're trying to catch here?

Also, I suspect you want either the bare filename or the abspath, not the
> path from the start dir (especially since a path rooted at the default '/'
> is two characters shorter than one rooted at 'C:\',

well, the startdir would be C:\  and now I'm confused about whether the
"C:\" is parto f the 255-something restriction!

anyway, WAY OT -- and if this is used it will be mainly to flag potential
problems, not really a robust test.

Thanks,

-CHB

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20151229/9a5c0b9e/attachment.html>