Objectively Quantifying Readability

The number and type of arguments about readability as a justification, or an opinion, or an opinion about an opinion seems counter-productive to reaching conclusions efficiently. I think they are very important either way, but the justifications used are not rich enough in information to be very useful. A study has been done regarding readability in code which may serve as insight into this issue. Please see page 8, fig 9 for a nice chart of the results, note the negative/positive coloring of the correlations, grey/black respectively. https://web.eecs.umich.edu/~weimerw/p/weimer-tse2010-readability-preprint.pd... The criteria in the paper can be applied to assess an increase or decrease in readability between current and proposed changes. Perhaps even an automated tool could be implemented based on agreed upon criteria. Opinions about readability can be shifted from: - "Is it more or less readable?" to - "This change exceeds a tolerance for levels of readability given the scope of the change." Still need to argue "exceeds ...given" and "tolerance", but at least the readability score exists, and perhaps over time there will be consensus. Note this is an attempt to impact rhetoric in PEP (or other) discussions, not about supporting a particular PEP. Please consider this food for thought to increase the efficacy and efficiency of PEP discussions, not as commenting on any specific current discussion, which of course is the motivating factor of sending this email today. I think using python implicitly accepts readability being partially measurable, even if the resolution of current measure is too low to capture the changes currently being discussed. Perhaps using this criteria can increase that resolution. Thank you, - Matt

On Mon, Apr 30, 2018 at 11:28:17AM -0700, Matt Arcidy wrote:
A study has been done regarding readability in code which may serve as insight into this issue. Please see page 8, fig 9 for a nice chart of the results, note the negative/positive coloring of the correlations, grey/black respectively.
Indeed. It seems that nearly nothing is positively correlated to increased readability, aside from comments, blank lines, and (very weakly) arithmetic operators. Everything else hurts readability. The conclusion here is that if you want readable source code, you should remove the source code. *wink*
https://web.eecs.umich.edu/~weimerw/p/weimer-tse2010-readability-preprint.pd...
The criteria in the paper can be applied to assess an increase or decrease in readability between current and proposed changes. Perhaps even an automated tool could be implemented based on agreed upon criteria.
That's a really nice study, and thank you for posting it. There are some interested observations here, e.g.: - line length is negatively correlated with readability; (a point against those who insist that 79 character line limits are irrelevant since we have wide screens now) - conventional measures of complexity do not correlate well with readability; - length of identifiers was strongly negatively correlated with readability: long, descriptive identifier names hurt readability while short variable names appeared to make no difference; (going against the common wisdom that one character names hurt readability -- maybe mathematicians got it right after all?) - people are not good judges of readability; but I think the practical relevance here is very slim. Aside from questions about the validity of the study (it is only one study, can the results be replicated, do they generalise beyond the narrowly self- selected set of university students they tested?) I don't think that it gives us much guidance here. For example: 1. The study is based on Java, not Python. 2. It looks at a set of pre-existing source code features. 3. It gives us little or no help in deciding whether new syntax will or won't affect readability: the problem of *extrapolation* remains. (If we know that, let's say, really_long_descriptive_identifier_names hurt readability, how does that help us judge whether adding a new kind of expression will hurt or help readability?) 4. The authors themselves warn that it is descriptive, not prescriptive, for example replacing long identifier names with randomly selected two character names is unlikely to be helpful. 5. The unfamiliarity affect: any unfamiliar syntax is going to be less readable than a corresponding familiar syntax. It's a great start to the scientific study of readability, but I don't think it gives us any guidance with respect to adding new features.
Opinions about readability can be shifted from: - "Is it more or less readable?" to - "This change exceeds a tolerance for levels of readability given the scope of the change."
One unreplicated(?) study for readability of Java snippets does not give us a metric for predicting the readability of new Python syntax. While it would certainly be useful to study the possibly impact of adding new features to a language, the authors themselves state that this study is just "a framework for conducting such experiments". Despite the limitations of the study, it was an interesting read, thank you for posting it. -- Steve

On Tue, May 1, 2018 at 10:42 AM, Steven D'Aprano <steve@pearwood.info> wrote:
The conclusion here is that if you want readable source code, you should remove the source code. *wink*
That's more true than your winky implies. Which is more readable: a Python function, or the disassembly of its corresponding byte-code? Which is more readable: a "for item in items:" loop, or one that iterates up to the length of the list and subscripts it each time? The less code it takes to express the same concept, the easier it is to read - and to debug. So yes, if you want readable source code, you should have less source code. ChrisA

On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 30, 2018 at 11:28:17AM -0700, Matt Arcidy wrote:
A study has been done regarding readability in code which may serve as insight into this issue. Please see page 8, fig 9 for a nice chart of the results, note the negative/positive coloring of the correlations, grey/black respectively.
Indeed. It seems that nearly nothing is positively correlated to increased readability, aside from comments, blank lines, and (very weakly) arithmetic operators. Everything else hurts readability.
The conclusion here is that if you want readable source code, you should remove the source code. *wink*
https://web.eecs.umich.edu/~weimerw/p/weimer-tse2010-readability-preprint.pd...
The criteria in the paper can be applied to assess an increase or decrease in readability between current and proposed changes. Perhaps even an automated tool could be implemented based on agreed upon criteria.
That's a really nice study, and thank you for posting it. There are some interested observations here, e.g.:
- line length is negatively correlated with readability;
(a point against those who insist that 79 character line limits are irrelevant since we have wide screens now)
- conventional measures of complexity do not correlate well with readability;
- length of identifiers was strongly negatively correlated with readability: long, descriptive identifier names hurt readability while short variable names appeared to make no difference;
(going against the common wisdom that one character names hurt readability -- maybe mathematicians got it right after all?)
- people are not good judges of readability;
but I think the practical relevance here is very slim. Aside from questions about the validity of the study (it is only one study, can the results be replicated, do they generalise beyond the narrowly self- selected set of university students they tested?) I don't think that it gives us much guidance here. For example:
I don't propose to replicate correlations. I don't see these "standard" terminal conclusions as forgone when looking at the idea as a whole, as opposed to the paper itself, which they may be. The authors crafted a method and used that method to do a study, I like the method. I think I can agree with your point about the study without validating or invalidating the method.
1. The study is based on Java, not Python.
An objective measure can be created, based or not on the paper's parameters, but it clearly would need to be adjusted to a specific language, good point. Here "objective" does not mean "with absolute correctness" but "applied the same way such that a 5 is always a 5, and a 5 is always greater than 4." I think I unfortunately presented the paper as "The Answer" in my initial email, but I didn't intend to say "each detail must be implemented as is" but more like "this is a thing which can be done." Poor job on my part.
2. It looks at a set of pre-existing source code features.
3. It gives us little or no help in deciding whether new syntax will or won't affect readability: the problem of *extrapolation* remains.
(If we know that, let's say, really_long_descriptive_identifier_names hurt readability, how does that help us judge whether adding a new kind of expression will hurt or help readability?)
A new feature can remove symbols or add them. It can increase density on a line, or remove it. It can be a policy of variable naming, or it can specifically note that variable naming has no bearing on a new feature. This is not limited in application. It's just scoring. When anyone complains about readability, break out the scoring criteria and assess how good the _comparative_ readability claim is: 2 vs 10? 4 vs 5? The arguments will no longer be singularly about "readability," nor will the be about the question of single score for a specific statement. The comparative scores of applying the same function over two inputs gives a relative difference. This is what measures do in the mathematical sense. Maybe the "readability" debate then shifts to arguing criteria: "79? Too long in your opinion!" A measure will at least break "readability" up and give some structure to that argument. Right now "readability" comes up and starts a semi-polite flame war. Creating _any_ criteria will help narrow the scope of the argument. Even when someone writes perfectly logical statements about it, the statements can always be dismantled because it's based in opinion. By creating a measure, objectivity is forced. While each criterion is less or more subjective, the measure will be applied objectively to each instance, the same way, to get a score.
4. The authors themselves warn that it is descriptive, not prescriptive, for example replacing long identifier names with randomly selected two character names is unlikely to be helpful.
Of course, which is why it's a score, not a single criterion. For example, if you hit the Shannon limit, no one will be able to read it anyways. "shorter is better" doesn't mean "shortest is best".
5. The unfamiliarity affect: any unfamiliar syntax is going to be less readable than a corresponding familiar syntax.
Definitely, let me respond specifically, but as an example of how to apply a measure flexibly. A criterion can be turned on/off based on the target of the new feature. Do you want beginners to understand this? Is this for core developers? If there exists one measure, another can be created by adding/subtracting criteria. I'm not saying do it, I'm saying it can be done. It's a matter of conditioning, like a marginal distribution. Core developers seem fairly indifferent to symbolic density on a line, but many are concerned about beginners. Heck, run both measures and see how dramatically the numbers change.
It's a great start to the scientific study of readability, but I don't think it gives us any guidance with respect to adding new features.
Opinions about readability can be shifted from: - "Is it more or less readable?" to - "This change exceeds a tolerance for levels of readability given the scope of the change."
One unreplicated(?) study for readability of Java snippets does not give us a metric for predicting the readability of new Python syntax. While it would certainly be useful to study the possibly impact of adding new features to a language, the authors themselves state that this study is just "a framework for conducting such experiments".
It's example of a measure. I presented it poorly, but even poor presentation should not prevent acknowledging that objective measures exist today. "In english" is a good one for me for sure, I'm barely monolingual. Perhaps agreement on the criteria will be attritive, perhaps impossible, but the "pure opinion" argument is definitely not true. This should be clearly noted, specifically because there is _so much upon which is already agreed_, which is a real tragedy here. Even as information theory is useful in this pursuit, it cannot be applied to limit, or we'd be trying to read/write bzip hex files. I think what you have mentioned enhances the point that rules exist, and your points can be formalized to rules and then incorporated into a model. Using your names example again, single letter names are very not meaningful, and 5 random alphanumerics are no better, perhaps even less so if 'e' is Exception and now it's 'et82c'. However, 5 letters that trigger a spell checker to propose the correct concept pointed to by the name, clearly has _value_ over the random 5 alphanumerics, i.e. a Hamming distance type measure would measure this improvement perfectly. As for predictability, every possible statement has a score that just needs to be computed by the measure, measures are not predictive. Running a string of symbols will result in a number based on patterns, as it's not a semantic analysis. If the scoring is garbage, the measure is garbage and needs to be redone, but it's not because it fails at predicting. Each symbol string has a score, precisely as two points have a distance in the euclidean measure. In lieu of a wide statistical study, assumptions will be made, argued, set, used, argued again, etc. This is life. But the rhetoric of "readability; there for my opinion is right or your statement is an opinion" will be tempered. A criteria can be theorized or accepted as a de-facto tool. Or not. But it can exist.
Despite the limitations of the study, it was an interesting read, thank you for posting it.
I think I presented the paper as "The answer" as opposed to "this is an approach." I agree that some, perhaps many, of the paper specifics are wholly irrelevant. Crafting a meaningful measure of readability is do-able, however. Obtaining agreement is still hard, and perhaps unfortunately impossible, but a measure exists. I supposed I'll build something and see where it goes, oddly enough, I am very enamored with my own idea! I appreciate your feedback and will incorporate it, and if you have any more, I am interested to hear it.
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Apr 30, 2018 at 8:46 PM, Matt Arcidy <marcidy@gmail.com> wrote:
On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
(If we know that, let's say, really_long_descriptive_identifier_names hurt readability, how does that help us judge whether adding a new kind of expression will hurt or help readability?)
A new feature can remove symbols or add them. It can increase density on a line, or remove it. It can be a policy of variable naming, or it can specifically note that variable naming has no bearing on a new feature. This is not limited in application. It's just scoring. When anyone complains about readability, break out the scoring criteria and assess how good the _comparative_ readability claim is: 2 vs 10? 4 vs 5? The arguments will no longer be singularly about "readability," nor will the be about the question of single score for a specific statement. The comparative scores of applying the same function over two inputs gives a relative difference. This is what measures do in the mathematical sense.
Unfortunately, they kind of study they did here can't support this kind of argument at all; it's the wrong kind of design. (I'm totally in favor of being more evidence-based decisions about language design, but interpreting evidence is tricky!) Technically speaking, the issue is that this is an observational/correlational study, so you can't use it to infer causality. Or put another way: just because they found that unreadable code tended to have a high max variable length, doesn't mean that taking those variables and making them shorter would make the code more readable. This sounds like a finicky technical complaint, but it's actually a *huge* issue in this kind of study. Maybe the reason long variable length was correlated with unreadability was that there was one project in their sample that had terrible style *and* super long variable names, so the two were correlated even though they might not otherwise be related. Maybe if you looked at Perl, then the worst coders would tend to be the ones who never ever used long variables names. Maybe long lines on their own are actually fine, but in this sample, the only people who used long lines were ones who didn't read the style guide, so their code is also less readable in other ways. (In fact they note that their features are highly correlated, so they can't tell which ones are driving the effect.) We just don't know. And yeah, it doesn't help that they're only looking at 3 line blocks of code and asking random students to judge readability – hard to say how that generalizes to real code being read by working developers. -n -- Nathaniel J. Smith -- https://vorpus.org

On Tue, May 1, 2018 at 1:29 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Apr 30, 2018 at 8:46 PM, Matt Arcidy <marcidy@gmail.com> wrote:
On Mon, Apr 30, 2018 at 5:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
(If we know that, let's say, really_long_descriptive_identifier_names hurt readability, how does that help us judge whether adding a new kind of expression will hurt or help readability?)
A new feature can remove symbols or add them. It can increase density on a line, or remove it. It can be a policy of variable naming, or it can specifically note that variable naming has no bearing on a new feature. This is not limited in application. It's just scoring. When anyone complains about readability, break out the scoring criteria and assess how good the _comparative_ readability claim is: 2 vs 10? 4 vs 5? The arguments will no longer be singularly about "readability," nor will the be about the question of single score for a specific statement. The comparative scores of applying the same function over two inputs gives a relative difference. This is what measures do in the mathematical sense.
Unfortunately, they kind of study they did here can't support this kind of argument at all; it's the wrong kind of design. (I'm totally in favor of being more evidence-based decisions about language design, but interpreting evidence is tricky!) Technically speaking, the issue is that this is an observational/correlational study, so you can't use it to infer causality. Or put another way: just because they found that unreadable code tended to have a high max variable length, doesn't mean that taking those variables and making them shorter would make the code more readable.
I think you are right about the study, but are tangential to what I am trying to say. I am not inferring causality when creating a measure. In the most tangible example, there is no inference that the euclidean measure _creates_ a distance, or that _anything_ creates a distance at all, it merely generates a number based on coordinates in space. That generation has specific properties which make it a measure, or a metric, what have you. The average/mean is another such object: a measure of central tendency or location. It does not infer causality, it is merely an algorithm by which things can be compared. Even misapplied, it provides a consistent ranking of one mean higher than another in an objective sense. Even if not a single person agrees that line length is a correct measure for an application, it is a measure. I can feed two lines into "len" and get consistent results out. This result will be the same value for all strings of length n, and for a string with length m
n, the measure will always report a higher measured value for the string of length m than the string of length n. This is straight out of measure theory, the results are a distance between the two objects, not a reason why.
The same goes for unique symbols. I can count the unique symbols in two lines, and state which is higher. This does not infer a causality, nor do _which_ symbols matter in this example, only that I can count them, and that if count_1 == count_2, the ranks are equal aka no distance between them, and if count_1 > count_2, count 1 is ranked higher. The cause of complexity can be a number of things, but stating a bunch of criteria to measure is not about inference. Measuring the temperature of a steak doesn't infer why people like it medium rare. It just quantifies it.
This sounds like a finicky technical complaint, but it's actually a *huge* issue in this kind of study. Maybe the reason long variable length was correlated with unreadability was that there was one project in their sample that had terrible style *and* super long variable names, so the two were correlated even though they might not otherwise be related. Maybe if you looked at Perl, then the worst coders would tend to be the ones who never ever used long variables names. Maybe long lines on their own are actually fine, but in this sample, the only people who used long lines were ones who didn't read the style guide, so their code is also less readable in other ways. (In fact they note that their features are highly correlated, so they can't tell which ones are driving the effect.) We just don't know.
Your points here are dead on. It's not like a single metric will be the deciding factor. Nor will a single rank end all disagreements. It's a tool. Consider line length 79, that's an explicit statement about readability, it's "hard coded" in the language. Disagreement with the value 79 or even the metric line-length doesn't mean it's not a measure. Length is the euclidean measure in one dimension. The measure will be a set of filters and metrics that combine to a value or set of values in a reliable way. It's not about any sense of correctness or even being better, that is, at a minimum, an interpretation.
And yeah, it doesn't help that they're only looking at 3 line blocks of code and asking random students to judge readability – hard to say how that generalizes to real code being read by working developers.
Respectfully, this is practical application and not a PhD defense, so it will be generated by practical coding. People can argue about the chosen metrics, but it is a more informative debate than just the label "readability". If 10 people state a change badly violates one criteria, perhaps that can be easily addressed. if many people make multiple claims based on many criteria, there is a real readability problem (assuming the metric survived SOME vetting of course)
-n
-- Nathaniel J. Smith -- https://vorpus.org

On Tue, May 1, 2018, 02:55 Matt Arcidy <marcidy@gmail.com> wrote:
I am not inferring causality when creating a measure.
No, but when you assume that you can use that measure to *make* code more readable, then you're assuming causality. Measuring the
temperature of a steak doesn't infer why people like it medium rare. It just quantifies it.
Imagine aliens who have no idea how cooking works decide to do a study of steak rareness. They go to lots of restaurants, order steak, ask people to judge how rare it was, and then look for features that could predict these judgements. They publish a paper with an interesting finding: it turns out that restaurant decor is highly correlated with steak rareness. Places with expensive leather seats and chandeliers tend to serve steak rare, while cheap diners with sticky table tops tend to serve it well done. (I haven't done this study, but I bet if you did then you would find this correlation is actually true in real life!) Now, should we conclude based on this that if we want to get rare steak, the key is to *redecorate the dining room*? Of course not, because we happen to know that the key thing that changes the rareness of steak is how it's exposed to heat. But for code readability, we don't have this background knowledge; we're like the aliens. Maybe the readability metric in this study is like quantifying temperature; maybe it's like quantifying how expensive the decor is. We don't know. (This stuff is extremely non-obvious; that's why we force scientists-in-training to take graduate courses on statistics and experimental design, and it still doesn't always take.)
And yeah, it doesn't help that they're only looking at 3 line blocks of code and asking random students to judge readability – hard to say how that generalizes to real code being read by working developers.
Respectfully, this is practical application and not a PhD defense, so it will be generated by practical coding.
Well, that's the problem. In a PhD defense, you can get away with this kind of stuff; but in a practical application it has to actually work :-). And generalizability is a huge issue. People without statistical training tend to look at studies and worry about how big the sample size is, but that's usually not the biggest concern; we have ways to quantify how big your sample needs to be. There bigger problem is whether your sample is *representative*. If you're trying to guess who will become governor of California, then if you had some way to pick voters totally uniformly at random, you'd only need to ask 50 or 100 of them how they're voting to get an actually pretty good idea of what all the millions of real votes will do. But if you only talk to Republicans, it doesn't matter how many you talk to, you'll get a totally useless answer. Same if you only talk to people of the same age, or who all live in the same town, or who all have land-line phones, or... This is what makes political polling difficult, is getting a representative sample. Similarly, if we only look at out-of-context Java read by students, that may or may not "vote the same way" as in-context Python read by the average user. Science is hard :-(. -n

Objectively quantifying is easy. For example, def objective_readability_score(text): "Return the readability of `text`, a float in 0.0 .. 1.0" return 2.0 * text.count(":=") / len(text) Then
objective_readability_score("if value:") 0.0 objective_readability_score("if value := f():") 0.125 objective_readability_score(":=:=:=") 1.0
QED ;-)

Tim Peters wrote:
def objective_readability_score(text): "Return the readability of `text`, a float in 0.0 .. 1.0" return 2.0 * text.count(":=") / len(text)
A useful-looking piece of code, but it could be more readable. It only gives itself a readability score of 0.0136986301369863. -- Greg

On Tue, 01 May 2018 10:42:53 +1000, Steven D'Aprano wrote:
- people are not good judges of readability;
WTF? By definition, people are the *only* judge of readability.¹ I happen to be an excellent judge of whether a given block of code is readable to me. OTOH, if you mean is that I'm a bad judge of what makes code readable to you, and that you're a bad judge of what makes code readable to me, then I agree. :-) Dan ¹ Well, okay, compilers will tell you that your code is unreadable, but they're known to be fairly pedantic.

I must say my gut agrees that really_long_identifier_names_with_a_full_description don't look readable to me. Perhaps it's my exposure to (py)Qt, but I really like my classes like ThisName and my methods like thisOne. I also tend to keep them to three words max (real code from yesterday: getActiveOutputs(), or at most setAllDigitalOutputs()). I also really dislike more than 3 or 4 arguments. A question for another type of science would be, do I agree with this study because it agrees with me ? It should be noted that the snippets used were short and small. This might cause a bias towards short identifiers - after all, if you only got 3 to keep track of it, they're more likely to be distinct enough compared to when you have 20. I couldn't give a source, but IIRC people can hold up to around 5 to 7 concepts in their head at one time - which means that if you got less identifiers than that, you don't remember the names, but their concepts.(further reading shows this is supported with their strongest negative correlation - # of identifiers strongly decreases readability.). Compare it to RAM - it's only big enough for 5 to 7 identifiers, and after that you have to switch them out to the harddisk. *nobody* wants to code that does this switching, and our brains don't like running it either. I think this is one of the main reasons list/generator comprehensions increase readability so much. You can get rid of 1 or 2 variable names.

Yes, it seems that this study has many limitations which don't make its results very interesting for our community. I think the original point was that readability *can* be studied rationnally and scientifically, though. Regards Antoine. On Tue, 1 May 2018 09:00:44 +0200 Jacco van Dorp <j.van.dorp@deonet.nl> wrote:
I must say my gut agrees that really_long_identifier_names_with_a_full_description don't look readable to me. Perhaps it's my exposure to (py)Qt, but I really like my classes like ThisName and my methods like thisOne. I also tend to keep them to three words max (real code from yesterday: getActiveOutputs(), or at most setAllDigitalOutputs()). I also really dislike more than 3 or 4 arguments.
A question for another type of science would be, do I agree with this study because it agrees with me ?
It should be noted that the snippets used were short and small. This might cause a bias towards short identifiers - after all, if you only got 3 to keep track of it, they're more likely to be distinct enough compared to when you have 20. I couldn't give a source, but IIRC people can hold up to around 5 to 7 concepts in their head at one time - which means that if you got less identifiers than that, you don't remember the names, but their concepts.(further reading shows this is supported with their strongest negative correlation - # of identifiers strongly decreases readability.). Compare it to RAM - it's only big enough for 5 to 7 identifiers, and after that you have to switch them out to the harddisk. *nobody* wants to code that does this switching, and our brains don't like running it either. I think this is one of the main reasons list/generator comprehensions increase readability so much. You can get rid of 1 or 2 variable names. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Tue, May 01, 2018 at 04:50:05AM +0000, Dan Sommers wrote:
On Tue, 01 May 2018 10:42:53 +1000, Steven D'Aprano wrote:
- people are not good judges of readability;
WTF? By definition, people are the *only* judge of readability.¹
We're discussing an actual study that attempted, with some reasonable amount of success, to objectively measure readability without human judgement (aside from the initial judgement on which factors to measure and which to ignore). So given the the fact that there exists at least one non-human objective measurement of readability which corresponds reasonably well with human judgement, how do you reach the conclusion that only humans can be the judge of readability? Besides, even if we agreed that only people can do something, that doesn't mean that they are necessarily *good at it*.
I happen to be an excellent judge of whether a given block of code is readable to me.
In the same way that 93% of people say that they are an above-average driver, I'm sure that most people think that they are an excellent judge of readability. Including myself in *both* of those. (My wife thinks I'm a crappy driver, but her standard of "just barely adequate" is close to professional racecar drivers, so I don't let that worry me.) https://en.wikipedia.org/wiki/Illusory_superiority There are at least three ways that I can justify my statement, one objective, and two anecdotal: 1. The correlation between judgements of readability from different people is not very good: in the study discussed, it was about 0.6 or so. A correlation of 0.5 is equivalent to having everyone agree on half the samples and then rate the other half at random. The paper states: "This analysis seems to confirm the widely-held belief that humans agree significantly on what readable code looks like, but not to an overwhelming extent." 2. Anecdotally, we all know that many programmers are just awful. https://thedailywtf.com/ And presumably most of them think they are writing readable code. (There may be a small minority who are deliberately writing obfuscated code, but I doubt that's a significant demographic.) Most of us have had the experience of making code choices that are inexplicable and unreadable when we come back to it. http://www.threepanelsoul.com/comic/on-perl 3. Anecdotally, I have first hand experience with many people, including programmers, making dramatically sub-optimal choices while declaring that it is the most effective choice. To pick one example that applies to coders, I have known many people who swear black and blue that they work best with their editor configured to show code in a tiny, 8pt font, and I've watched them peering closely at the screen struggling to read the text and making typo after typo which they failed to notice. In other words, people are often not even a great judge of what is readable to *themselves*. -- Steve

On Tue, 01 May 2018 22:37:11 +1000, Steven D'Aprano wrote:
On Tue, May 01, 2018 at 04:50:05AM +0000, Dan Sommers wrote:
On Tue, 01 May 2018 10:42:53 +1000, Steven D'Aprano wrote:
- people are not good judges of readability;
I happen to be an excellent judge of whether a given block of code is readable to me.
In the same way that 93% of people say that they are an above-average driver, I'm sure that most people think that they are an excellent judge of readability. Including myself in *both* of those.
Are you claiming that I'm not an excellent judge of whether a given block of code is readable to me?
2. Anecdotally, we all know that many programmers are just awful.
No argument here! :-)
Readability is only one criterion by which to judge code. Most code on thedailywtf is bad due to varieties of bugs, inefficiencies, misunderstandings, or unnecessary complexities, regardless of how readable it is.
And presumably most of them think they are writing readable code ...
Why would you presume that? I haved worked with plenty programmers who didn't consider readability. If it passes the tests, then it's good code. If I have difficulty reading it, then that's my problem. Also, when I write code, I put down my text editor to review it myself before I submit it for formal review. My authoring criteria for good code is different from my reviewing criteria for good code; the latter includes more readability than the former.
... I have known many people who swear black and blue that they work best with their editor configured to show code in a tiny, 8pt font, and I've watched them peering closely at the screen struggling to read the text and making typo after typo which they failed to notice.
In other words, people are often not even a great judge of what is readable to *themselves*.
On that level, aesthetics definitely count (typographers have known this for centuries), but in an entirely different way. Is their code better when their editor shows it at 14.4pt? At 17.3pt? When we first started with color displays and syntax coloring editors, it was popular to make a language's keywords really stand out. But the keywords usually aren't the important part of the code (have you ever programmed in lisp?), and I find it easier to read algol-like code when the keywords are lowlighted rather than highlighted. In Python, for example, the word "import" is far less important than the name of the module being imported, especially when all the imports are grouped together near the top of the source file. Dan

On Tue, May 01, 2018 at 03:02:27PM +0000, Dan Sommers wrote:
I happen to be an excellent judge of whether a given block of code is readable to me.
In the same way that 93% of people say that they are an above-average driver, I'm sure that most people think that they are an excellent judge of readability. Including myself in *both* of those.
Are you claiming that I'm not an excellent judge of whether a given block of code is readable to me?
Of course not. I don't know you. I wouldn't dream of making *specific* claims about you. I speak only in broad generalities which apply to people in general. I'm reminded that in the 1990s during the UI wars between Apple and Microsoft, people had *really strong* opinions about the useability of the two OSes' GUIs. Macs required the user to use the mouse to navigate menus, while Windows also allowed the use to navigate them using the Alt key and arrow keys. Not surprisingly, *both* Mac users and Windows users were absolutely convinced that they were much more efficient using the method they were familiar with, and could justify their judgement. For example, Windows users typically said that having to move their hand from the keyboard to grab the mouse was slow and inefficient, and using the Alt key and arrows was much faster. But when researchers observed users in action, and timed how long it took them to perform simple tasks requiring navigating the menus, they found that using the mouse was significantly faster for *both* groups of users, both Windows and Mac users. The difference was that when Windows users used the mouse, even though they were *objectively* faster to complete the task compared to using the arrow keys, subjectively they swore that they were slower, and were *very confident* about their subjective experience. This is a good example of the overconfidence effect: https://en.wikipedia.org/wiki/Overconfidence_effect This shouldn't be read as a tale about Mac users being superior. One of the two methods had to be faster, and it happened to be Macs. My point is not about Macs versus Windows, but that people in general are not good at this sort of self-reflection. Another example of this is the way that the best professional athletes no longer rely on their own self-judgement about the best training methods to use, because the training techniques that athletes think are effective, and those which actually are effective, are not strongly correlated. Athletes are not great judges of what training works for themselves. The psychological processes that lead to these cognitive biases apply to us all, to some degree or another. Aside from you and me, of course. -- Steve

On Wed, 02 May 2018 05:08:41 +1000, Steven D'Aprano wrote:
The difference was that when Windows users used the mouse, even though they were *objectively* faster to complete the task compared to using the arrow keys, subjectively they swore that they were slower, and were *very confident* about their subjective experience.
Another driving analogy: when I get stuck at a stoplight, sometimes I take advantage of turn on red or a protected turn, even though I know that it's going to take longer to get where I'm going. But I feel better because I'm not just sitting there at the stoplight. Call it cognitive dissonance, I guess. Some of my coworkers claim that using vi is objectively faster or requires fewer keystrokes than using emacs. I counter that I've been using emacs since before they were born, and that I now do so with the reptilian part of my brain, which means that I can keep thinking about the problem at hand rather than about editing the source code. Who remembers the One True Brace Style holy wars? If we agreed on anything, it was to conform to existing code rather than to write new code in a different style. Reading a mixture of styles was harder, no matter which particular style you thought was better or why you thought it was better.
Athletes are not great judges of what training works for themselves.
Wax on, wax off? ;-) Dan

On 01/05/18 01:42, Steven D'Aprano wrote:
That's a really nice study, and thank you for posting it.
Seconded!
There are some interested observations here, e.g.:
- line length is negatively correlated with readability;
(a point against those who insist that 79 character line limits are irrelevant since we have wide screens now)
There are physiological studies of how we read that support this. I don't have any references to hand (I'd have to go hunting, and no promises because the ones I'm faintly aware of happened at least 40 years ago!), but the gist is that we don't sweep our eyes continuously along a line of text as you might assume, we jerk our attention across in steps.
- length of identifiers was strongly negatively correlated with readability: long, descriptive identifier names hurt readability while short variable names appeared to make no difference;
I'd be interested to know if there is a readability difference between really_long_descriptive_identifier_name and ReallyLongDescriptiveIdentifierNames. I rather suspect there is :-) and the study may be biased by picking on Java as its example language here. -- Rhodri James *-* Kynesim Ltd

Rhodri James wrote:
I'd be interested to know if there is a readability difference between really_long_descriptive_identifier_name and ReallyLongDescriptiveIdentifierNames.
As one data point on that, jerking my eyes quickly across that line I found it much easier to pick out the component words in the one with underscores. -- Greg

2018-05-01 14:54 GMT+02:00 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Rhodri James wrote:
I'd be interested to know if there is a readability difference between really_long_descriptive_identifier_name and ReallyLongDescriptiveIdentifierNames.
As one data point on that, jerking my eyes quickly across that line I found it much easier to pick out the component words in the one with underscores.
-- Greg
Which is funny, because I had the exact opposite. Might it be that we've had different conditioning ? Jacco

On Tue, May 1, 2018 at 8:04 AM, Jacco van Dorp <j.van.dorp@deonet.nl> wrote:
2018-05-01 14:54 GMT+02:00 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Rhodri James wrote:
I'd be interested to know if there is a readability difference between really_long_descriptive_identifier_name and ReallyLongDescriptiveIdentifierNames.
As one data point on that, jerking my eyes quickly across that line I found it much easier to pick out the component words in the one with underscores.
Which is funny, because I had the exact opposite.
Might it be that we've had different conditioning ?
Almost certainly. I started using CamelCase in the mid-'80s and it seems very natural to me, since we still use it for (as you mention) GUI packages derived from C extension modules with that convention. On the other hand, I've also written a lot of snake_form identifiers in non-GUI Python, so that seems fairly natural to me, too.

On Tue, May 1, 2018 at 6:04 PM, Jacco van Dorp <j.van.dorp@deonet.nl> wrote:
2018-05-01 14:54 GMT+02:00 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Rhodri James wrote:
I'd be interested to know if there is a readability difference between really_long_descriptive_identifier_name and ReallyLongDescriptiveIdentifierNames.
As one data point on that, jerking my eyes quickly across that line I found it much easier to pick out the component words in the one with underscores.
-- Greg
Which is funny, because I had the exact opposite.
Might it be that we've had different conditioning ?
Jacco
The one with underscores reads fairly better though. Might it be that it does read better? Ok, that's not scientific enough. But the scores 2:1 so far ;-) to be pedantic - ReallyLongDescriptiveIdentifierNames has also an issue with "I" which might confuse because it looks same as little L. Just to illustrate that choice of comparison samples is very sensitive thing. In such a way an experienced guy can even scam the experimental subjects by making samples which will show what he wants in result. Mikhail

On Tue, May 1, 2018 at 5:35 PM, Mikhail V <mikhailwas@gmail.com> wrote:
to be pedantic - ReallyLongDescriptiveIdentifierNames has also an issue with "I" which might confuse because it looks same as little L. Just to illustrate that choice of comparison samples is very sensitive thing. In such a way an experienced guy can even scam the experimental subjects by making samples which will show what he wants in result.
I love this discussion, but I think anything that isn't included in a .py file would have to be outside the scope, at least of the alpha version :). I am really interested in these factors in general, however. Now I'm surprised no one asks which font each other are using when determining readability. "serif? are you mad? no wonder!" "+1 on PEP conditional on mandatory yellow (#FFEF00) keyword syntax highlighting in vim" -Matt
Mikhail _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Matt, you took the words right out of my mouth! The fonts that are being used will have a big difference in readability, as will font size, foreground and background coloring, etc. It would be interesting to see if anyone has done a serious study of this type though, especially if they studied it over the course of several hours (I'm getting older, and I've noticed that after about 8-10 hours of coding it doesn't matter what I'm looking at, I can't focus enough to read it, but I don't know when I start to degrade, nor do I know if different fonts would help me degrade more slowly) Thanks, Cem Karan On Tue, May 1, 2018 at 9:03 PM, Matt Arcidy <marcidy@gmail.com> wrote:
On Tue, May 1, 2018 at 5:35 PM, Mikhail V <mikhailwas@gmail.com> wrote:
to be pedantic - ReallyLongDescriptiveIdentifierNames has also an issue with "I" which might confuse because it looks same as little L. Just to illustrate that choice of comparison samples is very sensitive thing. In such a way an experienced guy can even scam the experimental subjects by making samples which will show what he wants in result.
I love this discussion, but I think anything that isn't included in a .py file would have to be outside the scope, at least of the alpha version :). I am really interested in these factors in general, however. Now I'm surprised no one asks which font each other are using when determining readability.
"serif? are you mad? no wonder!" "+1 on PEP conditional on mandatory yellow (#FFEF00) keyword syntax highlighting in vim"
-Matt
Mikhail _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Wed, May 2, 2018 at 4:03 AM, Matt Arcidy <marcidy@gmail.com> wrote:
On Tue, May 1, 2018 at 5:35 PM, Mikhail V <mikhailwas@gmail.com> wrote:
to be pedantic - ReallyLongDescriptiveIdentifierNames has also an issue with "I" which might confuse because it looks same as little L. Just to illustrate that choice of comparison samples is very sensitive thing. In such a way an experienced guy can even scam the experimental subjects by making samples which will show what he wants in result.
I love this discussion, but I think anything that isn't included in a .py file would have to be outside the scope, at least of the alpha version :). I am really interested in these factors in general, however. Now I'm surprised no one asks which font each other are using when determining readability.
"serif? are you mad? no wonder!" "+1 on PEP conditional on mandatory yellow (#FFEF00) keyword syntax highlighting in vim"
Well, I am asking. Looking at online PEPs I am under impression everyone should use huge-sized Consolas and no syntax highlighting at all. Just as with "=" and "==". Making samples without highlighting will show a similarity issue. Making it with different highlighting/font style will show that there is no issue. Or, ":=" looks ok with Times New Roman, but with Consolas - it looks like Dr. Zoidberg's face. Mikhail
participants (13)
-
Antoine Pitrou
-
CFK
-
Chris Angelico
-
Dan Sommers
-
Eric Fahlgren
-
Greg Ewing
-
Jacco van Dorp
-
Matt Arcidy
-
Mikhail V
-
Nathaniel Smith
-
Rhodri James
-
Steven D'Aprano
-
Tim Peters