On Tue, May 1, 2018, 02:55 Matt Arcidy <marcidy@gmail.com> wrote:

I am not inferring causality when creating a measure.

No, but when you assume that you can use that measure to *make* code more readable, then you're assuming causality.

Measuring the
temperature of a steak doesn't infer why people like it medium rare.
It just quantifies it.

Imagine aliens who have no idea how cooking works decide to do a study of steak rareness. They go to lots of restaurants, order steak, ask people to judge how rare it was, and then look for features that could predict these judgements.

They publish a paper with an interesting finding: it turns out that restaurant decor is highly correlated with steak rareness. Places with expensive leather seats and chandeliers tend to serve steak rare, while cheap diners with sticky table tops tend to serve it well done.

(I haven't done this study, but I bet if you did then you would find this correlation is actually true in real life!)

Now, should we conclude based on this that if we want to get rare steak, the key is to *redecorate the dining room*? Of course not, because we happen to know that the key thing that changes the rareness of steak is how it's exposed to heat.

But for code readability, we don't have this background knowledge; we're like the aliens. Maybe the readability metric in this study is like quantifying temperature; maybe it's like quantifying how expensive the decor is. We don't know.

(This stuff is extremely non-obvious; that's why we force scientists-in-training to take graduate courses on statistics and experimental design, and it still doesn't always take.)

> And yeah, it doesn't help that they're only looking at 3 line blocks
> of code and asking random students to judge readability – hard to say
> how that generalizes to real code being read by working developers.

Respectfully, this is practical application and not a PhD defense, so
it will be generated by practical coding.

Well, that's the problem. In a PhD defense, you can get away with this kind of stuff; but in a practical application it has to actually work :-). And generalizability is a huge issue.

People without statistical training tend to look at studies and worry about how big the sample size is, but that's usually not the biggest concern; we have ways to quantify how big your sample needs to be. There bigger problem is whether your sample is *representative*. If you're trying to guess who will become governor of California, then if you had some way to pick voters totally uniformly at random, you'd only need to ask 50 or 100 of them how they're voting to get an actually pretty good idea of what all the millions of real votes will do. But if you only talk to Republicans, it doesn't matter how many you talk to, you'll get a totally useless answer. Same if you only talk to people of the same age, or who all live in the same town, or who all have land-line phones, or... This is what makes political polling difficult, is getting a representative sample.

Similarly, if we only look at out-of-context Java read by students, that may or may not "vote the same way" as in-context Python read by the average user. Science is hard :-(.

-n