Re: [Python-ideas] Objectively Quantifying Readability

1 May 2018

      On Tue, May 1, 2018 at 1:29 AM, Nathaniel Smith  wrote:
...
On Mon, Apr 30, 2018 at 8:46 PM, Matt Arcidy  wrote:
...
On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano  wrote:
...
(If we know that, let's say, really_long_descriptive_identifier_names
hurt readability, how does that help us judge whether adding a new kind
of expression will hurt or help readability?)
A new feature can remove symbols or add them.  It can increase density
on a line, or remove it.  It can be a policy of variable naming, or it
can specifically note that variable naming has no bearing on a new
feature.  This is not limited in application.  It's just scoring.
When anyone complains about readability, break out the scoring
criteria and assess how good the _comparative_ readability claim is:
2 vs 10?  4 vs 5?  The arguments will no longer be singularly about
"readability," nor will the be about the question of single score for
a specific statement.  The comparative scores of applying the same
function over two inputs gives a relative difference.  This is what
measures do in the mathematical sense.
Unfortunately, they kind of study they did here can't support this
kind of argument at all; it's the wrong kind of design. (I'm totally
in favor of being more evidence-based decisions about language design,
but interpreting evidence is tricky!) Technically speaking, the issue
is that this is an observational/correlational study, so you can't use
it to infer causality. Or put another way: just because they found
that unreadable code tended to have a high max variable length,
doesn't mean that taking those variables and making them shorter would
make the code more readable.
I think you are right about the study, but are tangential to what I am
trying to say.

I am not inferring causality when creating a measure.  In the most
tangible example, there is no inference that the euclidean measure
_creates_ a distance, or that _anything_ creates a distance at all, it
merely generates a number based on coordinates in space.  That
generation has specific properties which make it a measure, or a
metric, what have you.

The average/mean is another such object: a measure of central tendency
or location.  It does not infer causality, it is merely an algorithm
by which things can be compared.  Even misapplied, it provides a
consistent ranking of one mean higher than another in an objective
sense.

Even if not a single person agrees that line length is a correct
measure for an application, it is a measure.  I can feed two lines
into "len" and get consistent results out.   This result will be the
same value for all strings of length n, and for a string with length m
...
n, the measure will always report a higher measured value for the
string of length m than the string of length n.   This is straight out
of measure theory, the results are a distance between the two objects,
not a reason why.
The same goes for unique symbols.  I can count the unique symbols in
two lines, and state which is higher.  This does not infer a
causality, nor do _which_ symbols matter in this example, only that I
can count them, and that if count_1 == count_2, the ranks are equal
aka no distance between them, and if count_1 > count_2, count 1 is
ranked higher.

The cause of complexity can be a number of things, but stating a bunch
of criteria to measure is not about inference.  Measuring the
temperature of a steak doesn't infer why people like it medium rare.
It just quantifies it.
...
This sounds like a finicky technical complaint, but it's actually a
*huge* issue in this kind of study. Maybe the reason long variable
length was correlated with unreadability was that there was one
project in their sample that had terrible style *and* super long
variable names, so the two were correlated even though they might not
otherwise be related. Maybe if you looked at Perl, then the worst
coders would tend to be the ones who never ever used long variables
names. Maybe long lines on their own are actually fine, but in this
sample, the only people who used long lines were ones who didn't read
the style guide, so their code is also less readable in other ways.
(In fact they note that their features are highly correlated, so they
can't tell which ones are driving the effect.) We just don't know.
Your points here are dead on.  It's not like a single metric will be
the deciding factor.  Nor will a single rank end all disagreements.
It's a tool.  Consider line length 79, that's an explicit statement
about readability, it's "hard coded" in the language.  Disagreement
with the value 79 or even the metric line-length doesn't mean it's not
a measure.  Length is the euclidean measure in one dimension.

The measure will be a set of filters and metrics that combine to a
value or set of values in a reliable way.  It's not about any sense of
correctness or even being better, that is, at a minimum, an
interpretation.
...
And yeah, it doesn't help that they're only looking at 3 line blocks
of code and asking random students to judge readability – hard to say
how that generalizes to real code being read by working developers.
Respectfully, this is practical application and not a PhD defense,  so
it will be generated by practical coding.  People can argue about the
chosen metrics, but it is a more informative debate than just the
label "readability".  If 10 people state a change badly violates one
criteria, perhaps that can be easily addressed.  if many people make
multiple claims based on many criteria, there is a real readability
problem (assuming the metric survived SOME vetting of course)
...
-n
--
Nathaniel J. Smith -- https://vorpus.org