[Python-ideas] Objectively Quantifying Readability

Nathaniel Smith njs at pobox.com
Tue May 1 14:30:39 EDT 2018

On Tue, May 1, 2018, 02:55 Matt Arcidy <marcidy at gmail.com> wrote:

> I am not inferring causality when creating a measure.

No, but when you assume that you can use that measure to *make* code more
readable, then you're assuming causality.

Measuring the
> temperature of a steak doesn't infer why people like it medium rare.
> It just quantifies it.

Imagine aliens who have no idea how cooking works decide to do a study of
steak rareness. They go to lots of restaurants, order steak, ask people to
judge how rare it was, and then look for features that could predict these

They publish a paper with an interesting finding: it turns out that
restaurant decor is highly correlated with steak rareness. Places with
expensive leather seats and chandeliers tend to serve steak rare, while
cheap diners with sticky table tops tend to serve it well done.

(I haven't done this study, but I bet if you did then you would find this
correlation is actually true in real life!)

Now, should we conclude based on this that if we want to get rare steak,
the key is to *redecorate the dining room*? Of course not, because we
happen to know that the key thing that changes the rareness of steak is how
it's exposed to heat.

But for code readability, we don't have this background knowledge; we're
like the aliens. Maybe the readability metric in this study is like
quantifying temperature; maybe it's like quantifying how expensive the
decor is. We don't know.

(This stuff is extremely non-obvious; that's why we force
scientists-in-training to take graduate courses on statistics and
experimental design, and it still doesn't always take.)

> > And yeah, it doesn't help that they're only looking at 3 line blocks
> > of code and asking random students to judge readability – hard to say
> > how that generalizes to real code being read by working developers.
> Respectfully, this is practical application and not a PhD defense,  so
> it will be generated by practical coding.

Well, that's the problem. In a PhD defense, you can get away with this kind
of stuff; but in a practical application it has to actually work :-). And
generalizability is a huge issue.

People without statistical training tend to look at studies and worry about
how big the sample size is, but that's usually not the biggest concern; we
have ways to quantify how big your sample needs to be. There bigger problem
is whether your sample is *representative*. If you're trying to guess who
will become governor of California, then if you had some way to pick voters
totally uniformly at random, you'd only need to ask 50 or 100 of them how
they're voting to get an actually pretty good idea of what all the millions
of real votes will do. But if you only talk to Republicans, it doesn't
matter how many you talk to, you'll get a totally useless answer. Same if
you only talk to people of the same age, or who all live in the same town,
or who all have land-line phones, or... This is what makes political
polling difficult, is getting a representative sample.

Similarly, if we only look at out-of-context Java read by students, that
may or may not "vote the same way" as in-context Python read by the average
user. Science is hard :-(.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180501/e63d368d/attachment-0001.html>

More information about the Python-ideas mailing list