[Python-ideas] Objectively Quantifying Readability

Mon Apr 30 23:46:42 EDT 2018

On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Mon, Apr 30, 2018 at 11:28:17AM -0700, Matt Arcidy wrote:
>
>> A study has been done regarding readability in code which may serve as
>> insight into this issue. Please see page 8, fig 9 for a nice chart of
>> the results, note the negative/positive coloring of the correlations,
>> grey/black respectively.
>
> Indeed. It seems that nearly nothing is positively correlated to
> increased readability, aside from comments, blank lines, and (very
> weakly) arithmetic operators. Everything else hurts readability.
>
> The conclusion here is that if you want readable source code, you should
> remove the source code. *wink*
>
>
>> https://web.eecs.umich.edu/~weimerw/p/weimer-tse2010-readability-preprint.pdf
>>
>> The criteria in the paper can be applied to assess an increase or
>> decrease in readability between current and proposed changes.  Perhaps
>> even an automated tool could be implemented based on agreed upon
>> criteria.
>
>
> That's a really nice study, and thank you for posting it. There are some
> interested observations here, e.g.:
>
> - line length is negatively correlated with readability;
>
>   (a point against those who insist that 79 character line
>   limits are irrelevant since we have wide screens now)
>
> - conventional measures of complexity do not correlate well
>   with readability;
>
> - length of identifiers was strongly negatively correlated
>   with readability: long, descriptive identifier names hurt
>   readability while short variable names appeared to make
>   no difference;
>
>   (going against the common wisdom that one character names
>   hurt readability -- maybe mathematicians got it right
>   after all?)
>
> - people are not good judges of readability;
>
> but I think the practical relevance here is very slim. Aside from
> questions about the validity of the study (it is only one study, can the
> results be replicated, do they generalise beyond the narrowly self-
> selected set of university students they tested?) I don't think that it
> gives us much guidance here. For example:

I don't propose to replicate correlations.  I don't see these
"standard" terminal conclusions as forgone when looking at the idea as
a whole, as opposed to the paper itself, which they may be.  The
authors crafted a method and used that method to do a study, I like
the method.  I think I can agree with your point about the study
without validating or invalidating the method.

>
> 1. The study is based on Java, not Python.

An objective measure can be created, based or not on the paper's
parameters, but it clearly would need to be adjusted to a specific
language, good point.

Here "objective" does not mean "with absolute correctness" but
"applied the same way such that a 5 is always a 5, and a 5 is always
greater than 4."  I think I unfortunately presented the paper as "The
Answer" in my initial email, but I didn't intend to say "each detail
must be implemented as is" but more like "this is a thing which can be
done."  Poor job on my part.

>
> 2. It looks at a set of pre-existing source code features.
>
> 3. It gives us little or no help in deciding whether new syntax will or
> won't affect readability: the problem of *extrapolation* remains.
>
> (If we know that, let's say, really_long_descriptive_identifier_names
> hurt readability, how does that help us judge whether adding a new kind
> of expression will hurt or help readability?)

A new feature can remove symbols or add them.  It can increase density
on a line, or remove it.  It can be a policy of variable naming, or it
can specifically note that variable naming has no bearing on a new
feature.  This is not limited in application.  It's just scoring.
When anyone complains about readability, break out the scoring
criteria and assess how good the _comparative_ readability claim is:
2 vs 10?  4 vs 5?  The arguments will no longer be singularly about
"readability," nor will the be about the question of single score for
a specific statement.  The comparative scores of applying the same
function over two inputs gives a relative difference.  This is what
measures do in the mathematical sense.

Maybe the "readability" debate then shifts to arguing criteria: "79?
Too long in your opinion!"  A measure will at least break
"readability" up and give some structure to that argument.  Right now
"readability" comes up and starts a semi-polite flame war.  Creating
_any_ criteria will help narrow the scope of the argument.

Even when someone writes perfectly logical statements about it, the
statements can always be dismantled because it's based in opinion.
By creating a measure, objectivity is forced.  While each criterion is
less or more subjective, the measure will be applied objectively to
each instance, the same way, to get a score.

>
> 4. The authors themselves warn that it is descriptive, not prescriptive,
> for example replacing long identifier names with randomly selected two
> character names is unlikely to be helpful.

Of course, which is why it's a score, not a single criterion.   For
example, if you hit the Shannon limit, no one will be able to read it
anyways.  "shorter is better" doesn't mean "shortest is best".

>
> 5. The unfamiliarity affect: any unfamiliar syntax is going to be less
> readable than a corresponding familiar syntax.

Definitely, let me respond specifically, but as an example of how to
apply a measure flexibly.
A criterion can be turned on/off based on the target of the new
feature.  Do you want beginners to understand this?  Is this for core
developers?
If there exists one measure, another can be created by
adding/subtracting criteria. I'm not saying do it, I'm saying it can
be done.  It's a matter of conditioning, like a marginal distribution.
Core developers seem fairly indifferent to symbolic density on a line,
but many are concerned about beginners.  Heck, run both measures and
see how dramatically the numbers change.

>
>
> It's a great start to the scientific study of readability, but I don't
> think it gives us any guidance with respect to adding new features.
>
>
>> Opinions about readability can be shifted from:
>>  - "Is it more or less readable?"
>> to
>>  - "This change exceeds a tolerance for levels of readability given
>> the scope of the change."
>
> One unreplicated(?) study for readability of Java snippets does not give
> us a metric for predicting the readability of new Python syntax. While
> it would certainly be useful to study the possibly impact of adding new
> features to a language, the authors themselves state that this study is
> just "a framework for conducting such experiments".

It's example of a measure.  I presented it poorly, but even poor
presentation should not prevent acknowledging that objective measures
exist today.  "In english" is a good one for me for sure, I'm barely
monolingual.

Perhaps agreement on the criteria will be attritive, perhaps
impossible, but the "pure opinion" argument is definitely not true.
This should be clearly noted, specifically because there is _so much
upon which is already agreed_, which is a real tragedy here.  Even as
information theory is useful in this pursuit, it cannot be applied to
limit, or we'd be trying to read/write bzip hex files.

I think what you have mentioned enhances the point that rules exist,
and your points can be formalized to rules and then incorporated into
a model.

Using your names example again, single letter names are very not
meaningful, and 5 random alphanumerics are no better, perhaps even
less so if 'e' is Exception and now it's 'et82c'.  However, 5 letters
that trigger a spell checker to propose the correct concept pointed to
by the name, clearly has _value_ over the random 5 alphanumerics, i.e.
a Hamming distance type measure would measure this improvement
perfectly.

As for predictability, every possible statement has a score that just
needs to be computed by the measure, measures are not predictive.
Running a string of symbols will result in a number based on patterns,
as it's not a semantic analysis.   If the scoring is garbage, the
measure is garbage and needs to be redone, but it's not because it
fails at predicting.  Each symbol string has a score, precisely as two
points have a distance in the euclidean measure.

In lieu of a wide statistical study, assumptions will be made, argued,
set, used, argued again, etc.  This is life.  But the rhetoric of
"readability; there for my opinion is right or your statement is an
opinion" will be tempered.  A criteria can be theorized or accepted as
a de-facto tool.  Or not.  But it can exist.

>
> Despite the limitations of the study, it was an interesting read, thank
> you for posting it.

I think I presented the paper as "The answer" as opposed to "this is
an approach."  I  agree that some, perhaps many, of the paper
specifics are wholly irrelevant.   Crafting a meaningful measure of
readability is do-able, however.   Obtaining agreement is still hard,
and perhaps unfortunately impossible, but a measure exists.

I supposed I'll build something and see where it goes, oddly enough, I
am very enamored with my own idea!  I appreciate your feedback and
will incorporate it, and if you have any more, I am interested to hear
it.

>
>
> --
> Steve
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/