Docstrings considered too complicated

Thu Mar 4 21:55:46 EST 2010

On Thu, 04 Mar 2010 23:38:31 +1300, Gregory Ewing wrote:

> Steven D'Aprano wrote:
> 
>> True, but one can look at "best practice", or even "standard practice".
>> For Python coders, using docstrings is standard practice if not best
>> practice. Using strings as comments is not.
> 
> In that particular case, yes, it would be possible to objectively
> examine the code and determine whether docstrings were being used as
> opposed to above-the-function comments.
> 
> However, that's only a very small part of what goes to make good code.
> Much more important are questions like: Are the comments meaningful and
> helpful? Is the code reasonably self-explanatory outside of the
> comments? Is it well modularised, and common functionality factored out
> where appropriate? Are couplings between different parts minimised? Does
> it make good use of library code instead of re-inventing things? Is it
> free of obvious security flaws?
> 
> You can't *measure* these things. You can't objectively boil them down
> to a number and say things like "This code is 78.3% good; the customer
> requires it to be at least 75% good, so it meets the requirements in
> that area."
> 
> That's the way in which I believe that software engineering is
> fundamentally different from hardware engineering.

You are conflating two independent questions:

(a) Can we objectively judge the goodness of code, or is it subjective?

(b) Is goodness of code quantitative, or is it qualitative?

You can turn any qualitative measurement into a quantitative measurement 
by turning it into a score from 0 to 10 (say). Instead of "best/good/
average/poor/crap" just rate it from 1 through 5, and now you have a 
measurement that can be averaged and compared with other measurements.

The hard part is turning subjective judgements ("are these comments 
useful, or are they pointless or misleading?") into objective judgements. 
It may be that there is no entirely objective way to measure such things. 
But we can make quasi-objective judgements, by averaging out all the 
individual quirks of subjective judgement:

(1) Take 10 independent judges who are all recognised as good Python 
coders by their peers, and ask them to give a score of 1-5 for the 
quality of the comments, where 1 means "really bad" and 5 means "really 
good". If the average score is 4 or higher, gain a point. If the average 
score is 3 or lower, lose a point.

(2) Take 10 independent judges, as above, and rate the code on how self-
explanatory it is. An average score of 3 or higher gives a point; an 
average of under 2 loses a point.

(Note that I'm more forgiving of non-self-explanatory code than I am of 
bad comments. Better to have no comments than bad ones!)

And so on, through all the various metrics you want to measure.

If the total number of points exceeds some threshold, the software 
passes, otherwise it fails, and you have a nice list of all the weak 
areas that need improvement.

You can do fancy things too, like discard the highest and lowest score 
from the ten judges (to avoid an unusually strict, or unusually slack, 
judge from distorting the scores).

If this all seems too expensive, then you can save money by having fewer 
judges, perhaps even as few as a single code reviewer who is trusted to 
meet whatever standards you are hoping to apply. Or have the judges rate 
randomly selected parts of the code rather than all of it. This will 
severely penalise code that isn't self-explanatory and modular, as the 
judges will not be able to understand it and consequently give it a low 
score.

Of course, there is still a subjective component to this. But it is a 
replicable metric: any group of ten judges should give quite similar 
scores, up to whatever level of confidence you want, and one can perform 
all sorts of objective statistical tests on them to determine whether 
deviations are due by chance or not.

To do all this on the cheap, you could even pass it through something 
like PyLint, which gives you an objective (but not very complete) 
measurement of code quality.

The real problem isn't that defining code quality can't be done, but that 
it can't be done *cheaply*. There are cheap methods, but they aren't very 
good, and good methods, but they're very expensive.

-- 
Steven