[Python-Dev] Catching "return" and "return expr" at compile time

Sun, 12 Sep 1999 05:18:16 -0400

[MarkH]
> ...
> I based my assesment simply on my perception of what is likely to
> happen, not my opinion of what _should_ happen.

I based mine on what Guido was waiting for someone to say <wink>.

We worry too much about disagreeing here; different opinions are great!
Guido will squash the ones he can't stand anyway.

[about Aaron's pylint's lack of 1.5.2 smarts]
> ...
> Aaron agrees that a parser module based one would be better.

You can't beat a real parse, no argument there.  Luckily, the compiler
parses too.

[Guido]
> What stands in the way?
>
> (a) There's so much else to do...

How did Perl manage to attract 150 people with nothing to do except hack on
Perl internals?  "Wow, that code's such a mess I bet even *I* could get
something into it" <0.6 wink>.

> (b) *Someone* needs to design a good framework for spitting out
> warnings, and for giving the programmer a way to control which
> warnings to ignore. I've seen plenty of good suggestions here; now
> somebody should simply go off and come up with a proposal too good to
> refuse.

The response has been ... absent.  Anyone doing this?  I liked JimF's push
to make cmd-line options available to Python programs too.  Somehow they
seem related to me.

> (c) I have a different agenda with CP4E -- I think it would be great
> if the editor could do syntax checking and beyond *while you type*,
> like the spell checker in MS Word.  (I *like* Word's spell checker,
> even though I hate the style checker [too stupid], and I gladly put up
> with the spell checker's spurious complaints -- it's easy to turn off,
> easy to ignore, and it finds lots of good stuff.)
>
> Because the editor has to deal with incomplete and sometimes
> ungrammatical things, and because it has to work in real time (staying
> out of the way when you're frantically typing, catching up when your
> fingers take a rest), it needs a different kind of parser.

Different from what?  Python's own parser for sure.  IDLE has at least two
distinct parsers of its own that have nothing in common with Python's parser
and little in common with each other.  Using the horrid tricks in
PyParse.py, it may even be possible to write the kind of parser you need in
Python and still have it run fast enough.

For parsing-on-the-fly from random positions, take my word for it and
Barry's as insurance <wink>:  the single most frequent question you need to
have a fast and reliable answer for is "is this character in a string?".
Unfortunately, turns out that's the hardest question to answer too.  The
next one is "am I on a continuation line, and if so where's the start?".
Given rapid & bulletproof ways to answer those, the rest is pretty easy.

> But that's another project, and once the Python core has a warning
> framework in place, I'm sure we'll find more things that are worth
> warning about.

That was frequently predicted for various pylint projects too <wink>.

> I'm not always in agreement with Tim Peters when he says that Python
> is so dynamic that it's impossible to check for certain errors.  It
> may be impossible to say *for sure* that something is an error, but
> there sure are lots of situations where you're doing something that's
> *likely* to be an error.

We have no disagreement there.  What a compiler does must match the
advertised semantics of the language-- or its own advertised deviations from
those --without compromise.  A warning system has no such constraint; to the
contrary, in the case of a semantic mess like Perl, most of its value is in
pointing out *legal* constructs that are unlikely to work the way you
intended.

> E.g. if the compiler sees len(1), and there's no local or global
> variable named len, it *could* be the case that the user has set up a
> parallel universe where the len() built-in accepts an integer
> argument, but more likely it's an honest mistake, and I would like to
> get a warning about it.

Me too.  More:  I'd also like to get a warning for *having* a local or
global variable named len!  Masking the builtin names is simply bad
practice, and is also usually an honest mistake.

BTW, I was surprised that the most frequent gotcha among new Python users at
Dragon turned out to be exactly that:  dropping a "len" or a "str" or
whatever (I believe len, str and list were most common) into their
previously working code-- because they just learned about that builtin --and
getting runtime errors as a result.  That is, they already had a local var
of that name, and forgot.  Then they were irked that Python didn't nag them
from the start (with a msg they understood, of course).

> The hard part here is to issue this warning for len(x) where x is some
> variable or expression that is likely to have a non-sequence value
> (barring alternate universes); this might require global analysis
> that's hard or expensive enough that we can't put it in the core
> (yet).  This may be seen as an argument for a separate lint...

Curiously, Haskell is statically type-safe but doesn't require declarations
of any kind -- it does global analysis, and has a 100% reliable type
inference engine (the language was, of course, designed to make this true).
Yet I don't think I've ever seen a Haskell program on the web that didn't
explicitly declare the type of every global anyway.  I think this is
natural, too:  while it's a PITA to declare the type of every stinking local
that lives for two lines and then vanishes, the types of non-local names
aren't self-evident:  type decls really help for them.

So if anyone is thinking of doing the kind of global analysis Guido mentions
here, and is capable of doing it <wink>, I'd much rather they put their
effort into optional static type decls for Python2.  Many of the same
questions need to be answered either way (like "what's a type?", and "how do
we spell a type?" -- the global analysis warnings won't do any good if you
can't communicate the substance of an error <wink>), and optional decls are
likely to have bigger bang for the buck.

[Skip Montanaro]
> ...
> Perl's experience with -w seems to suggest that it's best to always
> enable whatever warnings you can as well.

While that's my position, I don't want to oversell the Perl experience.
That language allows so many goofy constructs, and does so many wild guesses
at runtime, that Perl is flatly unusable without -w for non-trivial
programs.  Not true of Python, although the kinds of warnings people have
suggested so far certainly do seem worthy of loud complaint by default.

> (More and more I see people using gcc's -Wall flag as well.)

If you have to write portable C++ code, and don't enable every warning you
can get on every compiler you have, and don't also turn on "treat warnings
as errors", non-portable code will sneak into the project rapidly.  That's
my experience, over & over.  gcc catches stuff MS doesn't, and vice versa,
and MetroWerks yet another blob, and platform-specific cruft *still* gets
in.  It's irksome.

> Now, my return consistency stuff was easy enough to write in C for two
> reasons.  One, I'm fairly comfortable with the compile.c code.

I don't anticipate dozens of people submitting new warning code.  It would
be unprecendented if even two of us decided this was our thing.  If would be
almost unprecendented if even one of us followed up on it <0.6 wink>.

> Two, adding my checks required no extra memory management overhead.

Really good global analysis likely requires again as much C code as already
exists.  Luckily, I don't think putting in some warnings requires that all
conceivable warnings be implemented at once <wink>.  For stuff that complex,
I'd rather make it optional and write it in Python; I don't believe any law
prevents the compiler from running Python code.

> Consider a few other checks you might conceivably add to the byte code
> compiler:
>
>     * tab nanny stuff (already enabled with -t, right?)

Very timidly, yes <wink>.  Doesn't complain by default, and you need -tt to
make it an error.  Only catches 4 vs 8 tab size ambiguities, but that's good
enough for almost everyone almost all the time.

>     * variables set but not used
>     * variables used before set

These would be wonderful.  The Perl/pylint "gripe about names unique in a
module" is a cheap approximation that gets a surprising percentage of the
benefit for the cost of a dict and an exception list.

> If all of this sort of stuff is added to the compiler proper, I predict a
> couple major problems will surface:
>
>     * The complexity of the code will increase significantly, making it
>       harder to maintain and enhance

The complexity of the existing code should be almost entirely unaffected,
because non-trivial semantic analysis belongs in a new subsystem with its
own code.

>     * Fewer and fewer people will be able to approach the code, making it
>       less likely that new checks are added

As opposed to what?  The status quo, with no checks at all?  Somehow, facing
the prospect of *some* checks doesn't frighten me away <wink>.  Besides, I
don't buy the premise:  if someone takes this on as their project, worrying
that they'll decline to add new valuable checks is like MarkH worrying that
I wouldn't finish adding full support for stinking tabs to the common
IDLE/PythonWin editing components.  People take pride in their hackery.

>     * Future extensions like pluggable virtual machines will be harder
>       to add because their byte code compilers will be harder to integrate
>       into the core

If you're picturing adding this stuff sprayed throughout the guts of the
existing com_xxx routines, we've got different pictures in mind.

Semantic analysis is usually a pass between parsing and code generation,
transforming the parse tree and complaining about source it thinks is fishy.
If done in any conventional way, it has no crosstalk at all with either the
parsing work that precedes it or the code generation that follows it.  It's
a pipe stage between them, whose output is of the same type as its input.
That is, it's a "pluggable component" in its own right, and doesn't even
need to be run.  So potential for interference just isn't there.

At present, Python is very unusual both in:

1) Having no identifiable semantic pass at all, parsing directly to byte
code, and enforcing its few semantic constraints (like "'continue' not
properly in loop") lumped in with both of those.

and

2) Having one trivial optimization pass-- 76 lines of code instead of the
usual 76,000 <wink> --running after the byte code has been generated.
However, the sole transformation made here (distinguishing local from
non-local names) is much more properly viewed as being a part of semantic
analysis than as being "an optimization".  It's deducing trivial info about
what names *mean* (i.e., basic semantics), and is called optimization here
only because Python didn't do it at first.

So relating this to a traditional compiler, I'd say that "optimize()" is
truly Python's semantic analysis pass, and all that comes before it is the
parsing pass -- a parsing pass with output in a form that's unfortunately
clumsy for further semantic analysis, but so it goes.  The byte code is such
a direct reflection of the parse tree that there's really little fundamental
difference between them.

So for minimal disruption, I'd move "optimize" into a new module and call it
the semantic analysis pass, and it would work with the byte code.  Just as
now, you wouldn't *need* to call it at all.  Unlike now, the parsing pass
probably needs to save away some more info (e.g., I don't *think* it keeps
track of what all is in a module now in any usable way).

For Python2, I hope Guido adopts a more traditional structure (i.e., parsing
produces a parse tree, codegen produces bytecode from a parse tree, and
other tree->tree transformers can be plugged in between them).  Almost all
compilers follow this structure, and not because compiler writers are
unimaginative droids <wink>.  Compile-time for Python isn't anywhere near
being a problem now, even on my creaky old 166MHz machine; I suspect the
current structure reflects worry about that on much older & slower machines.

Some of the most useful Perl msgs need to be in com_xxx, though, or even
earlier.  The most glaring example is runaway triple-quoted strings.
Python's "invalid token" pointing at the end of the file is maddeningly
unhelpful; Perl says it looks like you have a runaway string, and gives the
line number it thinks it may have started on.  That guess is usually
correct, or points you to what you *thought* was the end of a different
string.  Either way your recovery work is slashed.  (Of course IDLE is even
better:  the whole tail of the file changes to "string color", and you just
need to look up until the color changes!)

> In addition, more global checks probably won't be possible (reasoning
about
> code across module boundaries for instance) because the compiler's view of
> the world is fairly narrow.

As above, I don't think there's enough now even to analyze one module in
isolation.

> I think lint-like tools should be implemented in Python (possibly with the
> support of an extension module for performance-critical sections) which is
> then called from the compiler proper under appropriate circumstances
> (warnings enabled, all necessary modules importable, etc).

I have no objection to that.  I do object to the barely conceivable getting
in the way of the plainly achievable, though -- the pylint tools out there
now, just like your return consistency checker, do a real service already
without any global analysis.  Moving that much into the core (implemented in
Python if possible, recoded in C if not) gets a large chunk of the potential
benefit for maybe 5% of the eventual work.

It's nice that Guido is suddenly keen on global analysis too, but I don't
see him volunteering to do any work either <wink>.

> I believe the code would be much more easily maintained and extended.

If it's written in Python, of course.

> You'd be able to swap in a new byte code compiler without risking the
> loss of your checking code.

I never understood this one; even if there *were* a competing byte code
compiler out there <0.1 wink>, keeping as much as possible out of com_xxx
should render it a non-issue.  If I'm missing your point and there is some
fundamental conflict here, fine, then it's another basis on which bytecode
compilers will compete.

more-concerned-about-things-that-exist-than-things-that-don't-ly y'rs  - tim