[Python-ideas] Objectively Quantifying Readability

Matt Arcidy marcidy at gmail.com
Tue May 1 05:55:05 EDT 2018

On Tue, May 1, 2018 at 1:29 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On Mon, Apr 30, 2018 at 8:46 PM, Matt Arcidy <marcidy at gmail.com> wrote:
>> On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>>> (If we know that, let's say, really_long_descriptive_identifier_names
>>> hurt readability, how does that help us judge whether adding a new kind
>>> of expression will hurt or help readability?)
>> A new feature can remove symbols or add them.  It can increase density
>> on a line, or remove it.  It can be a policy of variable naming, or it
>> can specifically note that variable naming has no bearing on a new
>> feature.  This is not limited in application.  It's just scoring.
>> When anyone complains about readability, break out the scoring
>> criteria and assess how good the _comparative_ readability claim is:
>> 2 vs 10?  4 vs 5?  The arguments will no longer be singularly about
>> "readability," nor will the be about the question of single score for
>> a specific statement.  The comparative scores of applying the same
>> function over two inputs gives a relative difference.  This is what
>> measures do in the mathematical sense.
> Unfortunately, they kind of study they did here can't support this
> kind of argument at all; it's the wrong kind of design. (I'm totally
> in favor of being more evidence-based decisions about language design,
> but interpreting evidence is tricky!) Technically speaking, the issue
> is that this is an observational/correlational study, so you can't use
> it to infer causality. Or put another way: just because they found
> that unreadable code tended to have a high max variable length,
> doesn't mean that taking those variables and making them shorter would
> make the code more readable.
I think you are right about the study, but are tangential to what I am
trying to say.

I am not inferring causality when creating a measure.  In the most
tangible example, there is no inference that the euclidean measure
_creates_ a distance, or that _anything_ creates a distance at all, it
merely generates a number based on coordinates in space.  That
generation has specific properties which make it a measure, or a
metric, what have you.

The average/mean is another such object: a measure of central tendency
or location.  It does not infer causality, it is merely an algorithm
by which things can be compared.  Even misapplied, it provides a
consistent ranking of one mean higher than another in an objective

Even if not a single person agrees that line length is a correct
measure for an application, it is a measure.  I can feed two lines
into "len" and get consistent results out.   This result will be the
same value for all strings of length n, and for a string with length m
> n, the measure will always report a higher measured value for the
string of length m than the string of length n.   This is straight out
of measure theory, the results are a distance between the two objects,
not a reason why.

The same goes for unique symbols.  I can count the unique symbols in
two lines, and state which is higher.  This does not infer a
causality, nor do _which_ symbols matter in this example, only that I
can count them, and that if count_1 == count_2, the ranks are equal
aka no distance between them, and if count_1 > count_2, count 1 is
ranked higher.

The cause of complexity can be a number of things, but stating a bunch
of criteria to measure is not about inference.  Measuring the
temperature of a steak doesn't infer why people like it medium rare.
It just quantifies it.

> This sounds like a finicky technical complaint, but it's actually a
> *huge* issue in this kind of study. Maybe the reason long variable
> length was correlated with unreadability was that there was one
> project in their sample that had terrible style *and* super long
> variable names, so the two were correlated even though they might not
> otherwise be related. Maybe if you looked at Perl, then the worst
> coders would tend to be the ones who never ever used long variables
> names. Maybe long lines on their own are actually fine, but in this
> sample, the only people who used long lines were ones who didn't read
> the style guide, so their code is also less readable in other ways.
> (In fact they note that their features are highly correlated, so they
> can't tell which ones are driving the effect.) We just don't know.

Your points here are dead on.  It's not like a single metric will be
the deciding factor.  Nor will a single rank end all disagreements.
It's a tool.  Consider line length 79, that's an explicit statement
about readability, it's "hard coded" in the language.  Disagreement
with the value 79 or even the metric line-length doesn't mean it's not
a measure.  Length is the euclidean measure in one dimension.

The measure will be a set of filters and metrics that combine to a
value or set of values in a reliable way.  It's not about any sense of
correctness or even being better, that is, at a minimum, an

> And yeah, it doesn't help that they're only looking at 3 line blocks
> of code and asking random students to judge readability – hard to say
> how that generalizes to real code being read by working developers.

Respectfully, this is practical application and not a PhD defense,  so
it will be generated by practical coding.  People can argue about the
chosen metrics, but it is a more informative debate than just the
label "readability".  If 10 people state a change badly violates one
criteria, perhaps that can be easily addressed.  if many people make
multiple claims based on many criteria, there is a real readability
problem (assuming the metric survived SOME vetting of course)

> -n
> --
> Nathaniel J. Smith -- https://vorpus.org

More information about the Python-ideas mailing list