[Python-ideas] Re: PEP 671 (late-bound arg defaults), next round of discussion!

19 Jun 2022

      On Mon, 20 Jun 2022 at 05:02, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
...
On Sun, Jun 19, 2022 at 2:24 PM Chris Angelico <rosuav@gmail.com> wrote:
...
...
def frobnicate(data, verbose=os.environ.get('LEVEL')==loglevel.DEBUG): ...
Is there any value in not putting that into a global constant?
Probably not.  I was just inventing an ad hoc example to show what I meant.  I didn't search any actual repos I work on for real-life examples.
Ah okay. Well, if that WERE a real example, I would recommend giving
it a name. (Also, it's probably going to end up using >= rather than
==, so that the verbosity of any function can be set to a minimum
level, so there'd be more complexity, thus making it even more useful
to make it some sort of constant.)
...
...
Regardless, the @ operator is now available *everywhere* in Python. Does it quadratically increase cognitive load?
Yeah, probably about that much.  Other than NumPy or closely related array libraries, I don't know that many other uses. I think I saw something on PyPI that used it as an email thing, where obviously it has some familiarity.  But in that case, the lines it occurs on probably have no more than one or two other sigils.
In the numeric stuff, if I have:
newarray = (A @ B) | (C / D) + (E - F)
That's @, |, /, +, and -.  So 5 operators, and 25 "complexity points".  If I added one more operator, 36 "complexity points" seems reasonable.  And if I removed one of those operators, 16 "complexity points" feels about right.
For my part, I would say that it's quite the opposite. This is three
parenthesized tokens, each of which contains two things combined in a
particular way. That's six 'things' combined in particular ways.
Cognitive load is very close to this version:

newarray = (A * B) + (C * D) + (E * F)

even though this uses a mere two operators. It's slightly more, but
not multiplicatively so. (The exact number of "complexity points" will
depend on what A through F represent, but the difference between "all
multiplying and adding" and "five distinct operators" is only about
three points.)

So unless you have a study showing this, I would say we each have a
single data point - ourselves - and it's basically useless data.
...
...
In a function signature "def bisect(stuff, lo=0, hi=None)", you don't
know what the hi value actually defaults to. Even if it's obvious that
it is late-bound
Sure, knowing what `hi` defaults to *could be useful*.  I'm sure if I used that function I would often want to know... and also often just assume the default is "something sensible."  I just don't think that "could be useful" as a benefit is nearly as valuable as the cost of a new sigil and a new semantics adding to the cognitive load of Python.
Yes, but "something sensible" could be "len(stuff)", "len(stuff)-1",
or various other things. Knowing exactly which of those will tell you
exactly how to use the function.

Would you say that knowing that lo defaults to 0 is useful
information? You could just have a function signature that merely says
which arguments are mandatory and which are optional, and force people
to use the documentation to determine the behaviour of omitted
arguments. If you accept that showing "lo=0" gives useful information
beyond simply that lo is optional, then is it so hard to accept that
"hi=>len(stuff)" is also immensely valuable?
...
For example, it also "could be useful" to have syntax that indicated the (expected) big-O complexity of that function.  But whatever that syntax was, I really doubt it would be worth the extra complexity in the language vs. just putting that info in the docstring.
That's true; there's always a lot more that could go into a function's
docstring than can fit into its signature. Perhaps, if it's of value
to your project, it would be useful to use a function decorator and
then redefine the return value annotation to (also or instead) inform
you of the complexity. But for information about a single argument,
the only useful place to put it is on the argument itself - either in
the signature, or in a duplicated block in the docstring. And function
defaults are a lot broader in value than algorithmic complexity, which
is irrelevant to a huge number of functions.
...
Let's look at a function that has a lot of late-bound default arguments:
pd.read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    sep=<no_default>,
    delimiter=None,
    header='infer',
    names=<no_default>,
    index_col=None,
    usecols=None,
    squeeze=None,
    prefix=<no_default>,
    mangle_dupe_cols=True,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters=None,
    true_values=None,
    false_values=None,
    skipinitialspace=False,
    skiprows=None,
    skipfooter=0,
    nrows=None,
    na_values=None,
    keep_default_na=True,
    na_filter=True,
    verbose=False,
    skip_blank_lines=True,
    parse_dates=None,
    infer_datetime_format=False,
    keep_date_col=False,
    date_parser=None,
    dayfirst=False,
    cache_dates=True,
    iterator=False,
    chunksize=None,
    compression: 'CompressionOptions' = 'infer',
    thousands=None,
    decimal: 'str' = '.',
    lineterminator=None,
    quotechar='"',
    quoting=0,
    doublequote=True,
    escapechar=None,
    comment=None,
    encoding=None,
    encoding_errors: 'str | None' = 'strict',
    dialect=None,
    error_bad_lines=None,
    warn_bad_lines=None,
    on_bad_lines=None,
    delim_whitespace=False,
    low_memory=True,
    memory_map=False,
    float_precision=None,
    storage_options: 'StorageOptions' = None,
I'd have to look through the implementation, but my guess is that quite a few of the 25 late-bound defaults require calculations to set that take more than one line of code.  I really don't WANT to know more than "this parameter is calculated according to some logic, perhaps complex logic" ... well, unless I think it pertains to something I genuinely want to configure, in which case I'll read the docs.
Actually, I would guess that most of these default to something that's
set elsewhere. Judging only by the documentation, not actually reading
the source, here's what I can say:

delimiter=>sep, # It's an alias for sep
engine=>???, # seems the default is set elsewhere
na_values=>_DEFAULT_NA_VALUES, # there is a default in the docs
on_bad_lines='error' # seems this has a simple default

For the rest, though, these _do not have_ defaults. Not default
values, not default expressions. There is no code that could be placed
at the top of the function to assign behaviour to them. The None
default value actually means something different from passing in some
other value - for instance, "callable or None" means it actually won't
be calling any function if None is provided.

This function isn't a good showcase of PEP 671 - neither its strengths
nor its weaknesses -  because it simply doesn't work with argument
defaults in that way. It might be able to take advantage of it for a
couple of them, but it's certainly not going to change the sheer
number of None-default arguments that it has.

Maybe I'm wrong on that, and maybe you could show the lines of code at
the top of the function that could potentially be converted into
argument defaults, but otherwise, this is simply a function that
potentially does a lot of stuff, and only does the stuff for the
arguments you pass in.

(It could potentially benefit from a way to know whether the argument
was passed or not, but since None is a fine sentinel for all of these
args, there wouldn't be much to gain.)

ChrisA