[Types-sig] recursive types, type safety, and flow analysis

Tue, 21 Dec 1999 16:33:24 -0800 (PST)

On Tue, 21 Dec 1999, Guido van Rossum wrote:
> [Greg Stein]
> > I should have clarified that I don't think his particular example would
> > work because of the compile- / definition- time recursion of the names.
> > Runtime? Sure. It's fine.
> 
> Hm...  Since type checking is essentially a compile time activity, I
> think it would be better if the run time order of events didn't
> matter.

But runtime order does matter.

   a = 1
   func_taking_int(a)
   a = "foo"
   func_taking_string(a)

I can come up with all kinds of variants, but that's the basic pattern.
The code is perfectly type-safe, and depends on the order of events. If we
waited until the end, then "a" will either have type "String," or type
"Int or String." Either way, it produces false errors.

> Yes, in Pascal or C you need to declare everything before
> it's used.  This is a compromise because of old-fashioned one pass
> compiler technology.  I don't see a reason why we should adopt this
> rule in Python.  Note that Java doesn't have this either -- you can
> declare your methods in any order you like and the compiler will
> figure it out.

My new position (after your prodding my brain :-) is that each suite is
evaluated in order. The global suite first, then each of the function
bodies (in arbitrary order).

This basically gives us a multiple pass, and allows functions, variables,
classes, etc to be defined in any order.

But: it still doesn't allow for the recursive type declarators. To be
clear, it allows:

def f() -> String:
  return g()
def g() -> String:
  return "abc"

But it does not allow:

def f(x: Foo):
  ...
class Foo:
  ...

I believe the compiler should be recording information about the function
arguments' typedecls. Unless the compiler is going to have multiple passes
then the name should be defined before usage.

Or rather, let's assume that the function argument information is
constructed and recorded at runtime (as part of the standard function
object construction at runtime). Then you really have to ensure that name
is available, so the appropriate value can be stored into the function
object.

(this is, of course, predicated on recording signatures in the function
 object for use at runtime; I feel strongly that we should do this, as it 
 will dramatically assist some runtime tools/apps such as IDLE)

> But, you may say, in Python we have a certain run time order of events
> that defines the validity of names.  Names must be defined before they
> are used, and a later redefinition overrides an earlier one.
> 
> Of course.  (Although I wouldn't mind getting at least a compile time
> warning when I define two classes or methods with the same name; it
> can be frustrating when you're editing the first definition but your
> program keeps using the second one! :-)

We can surely issue a warning for a redefinition that changes the type.

> But checking that names are defined by the time they are used at run
> time is a different kind of check.  (Java does this too, to a decent
> extent.)  I personally find this a very useful check.  But it doesn't
> necessarily affect the compile time static typechecking.

But we have runtime information to record, and I also believe that we have
some runtime type checks to perform. In the above example, I think we
should be implicitly inserting code like so:

def f(x: Foo):
  x ! Foo
  ...

While we can do a lot of useful compile-time checks, I think we still have
runtime considerations that impose ordering.

> > Because Y is not defined. This is analogous to the following code:
> > 
> > class Foo:
> >   def build_it(self, x, y, cls=Bar):
> >     return cls(x, y)
> > 
> > class Bar:
> >   ...
> > 
> > The above code breaks. I am positing that if you put "Y" into a
> > declarator, then it should exist at that point in time. Where "time" is
> > specified as following the flow of execution as the functions/classes are
> > defined.
> 
> I disagree.  See above -- there's no reason to burden the compile time
> type checker with run time ordering.

I've loosened up to "globals first, then function bodies." That provides a
lot of relaxation of the requirements. (but it still does not allow for
recursive declarators)

To resolve the recursive declarator problem, I think we'd simply want a
notion of an undefined interface (much like an undefined struct). I'm not
sure of the mechanics for runtime resolution, but this would allow us to
do something like:

decl class Foo

class Bar:
  def method(self, x: Foo)->None:
    ...
class Foo:
  def method(self, x: Bar)->None:
    ...

We still have a runtime issue, however :-(

> > I don't think we can detect that call-before-definition, though.
> 
> But I think you can, in by far the most cases.  There may be a few

You can within a single block of code. It is very difficult across code
bodies, and requires an entirely different kind of analysis.

My point was about different code bodies.

> borderline cases where it's impossible to tell, and I don't mind
> requiring a little help to make the type checker happy.  For example,
> in C code I frequently add an initialization of a local variable to 0
> which isn't really necessary because it is initialized in a for loop,
> but the compiler isn't smart to figure out that the for loop will
> execute at least once.  Gcc -Wall complains about such cases, and
> shutting it up completely every once in a while is sufficiently
> satisfying that I'll add redundant code.  Of course, I'd be happier if
> gcc was smarter, and I hope that Python's type checker will usually be
> smarter -- and then in the remaining cases I think it's okay to help
> it.

Sure. This is all within a single code body. I agree that we can provide
use-before-definition errors.

Going back to the original problem:

def f():
  g()
def g():
  ...

There isn't a way to easily know that g is defined at the time f is
called. We don't even record that g is used by f!

The best that we can do is note that g is available in the global
namespace and that it has a proper type. But we can't determine whether it
might be Undefined or not.

Specifically, consider the two cases:

1)  del g
    f()

2)  f()
    del g

Unless the compiler knows that f() is going to use g, it can't do anything
here. It has to do some *serious* control flow analysis and record a lot
of information about f's requirements.

We might be able to go one step beyond and say (during f's analysis) that
g is possibly undefined, but we would get a lot of those warnings. That's
because we don't really have/record information that distinguishes these
patterns:

1) f()
   def g():
     ...

2) def g():
     ...
   f()

3) def g():
     ...
   del g
   f()

4) def g():
     ...
   f()
   del g

In case 1, you could say that g is "Func or Undefined" simply by stating a
policy that *any* global is "... or Undefined". The analysis of f() would
then raise an appropriate flag.

In case 2, any assumption of "or Undefined" is invalid. The code is fine.

In cases 3 and 4, maybe we're assuming that not all globals get an "or
Undefined" unless we see a "del" statement. That's fine, but we may have a
false warning because we can't different cases 3 and 4.

> > I think your point can be restated as: Can we type-check the following
> > code?
> > 
> > def f() -> String:
> >   return g()
> > 
> > def g() -> String:
> >   ...
> > 
> > I haven't thought about this particular scenario or the resulting impact
> > on the inferencer. We probably require some kind of a two-pass analysis as
> > John points out. Maybe it is as simple as deferring analysis of function
> > bodies until the global "body" is analyzed. Actually, I think the deferral
> > mechanism is sufficient, as that mirrors the execution environment: at the
> > time the function body is executed, the globals are defined.
> > [ with the caveat of call-before-define, but we can't determine that ]
> 
> I don't see a big problem here for the type checker.  Assuming that
> there's only one definition of g, and that we disallow changes to g
> from outside the module (and from exec statements), the type checker
> will have no trouble discovering that g is a global function
> definition, and it can collect its type info to help checking f.
> There may even be arbitrary cross references; the solution (from the
> type checking point of view) is to iterate until all definitions are
> found.

I agree. We can do this. This was my track about deferral.

> Again, checking that g is actually defined by the time f is called is
> a separate thing; but again in most cases this will be easy, since
> there is usually no executable code between the definitions of f and g
> (except perhaps other function definitions).  It's a simple flow check.

I disagree that it will be easy or that it is "a simple flow check."
Checking for undefined names is really only easy within a single code
body. As I outline further above, I don't think you can tell whether g is
really defined by the time f is called.

> > Hrm. The whole call-before-define problem might actually axe Paul's desire
> > for type safety. I don't think we can *ever* guarantee that a name will
> > exist at the time it is used. For example:
> > 
> > value = 1
> > def typesafe f():
> >   somefunc(value)
> > 
> > del value
> > f()
> > 
> > If you start doing *control* flow analysis, then you might be able to
> > definitely state the above code is in error. But then, I'll just throw
> > this wrench at it:
> > 
> > if sometest():
> >   del value
> > f()
> > 
> > Now what?
> 
> Simple.  After the if statement has executed, value is "possibly
> undefined".  This warrants a warning.

Right. But we don't cross-reference the fact that at after the if
statement is one of the times that we call f() and that f() happens to
need "value."

Recording this information about "when" is control flow. I think we just
want to record possible types and feed that into the type-checker. With
the presence of "del" in the above code, maybe we can record "or
Undefined", but that isn't really going to do what we'd like.

> > The type inferencing that I believe we can/should be using is based on
> > some basic data flow, without regard to definitively determining whether a
> > particular branch is reached or not. If a possibility exists, then the
> > possible types are unioned in. If type-safety is defined as "no
> > NameErrors", then we have a problem, as control flow is required.
> 
> I don't see the problem.  I claim that the examples you give are
> accidents waiting to happen, so it's only helpful if the type checker
> complains about them!

I agree, and it would be nice to have a warning. I don't think it is
possible (given the scope of analysis that I'm thinking of). You would
need a LOT more analysis to determine "undefined." And it would probably
have to be global (cross-module).

> > Note that I believe the following can be handled:
> > 
> > value = 1
> > def typesafe f()
> >   func_taking_int(value)
> > 
> > value = "foo"
> > f()
> > 
> > In this case, the global variable "value" has a typedecl of: Int or
> > String. This would fail the func_taking_int() function call.
> 
> Yes.  In my view, "possibly undefined" is no different than "int or
> string".

Agreed. See above.

> > Back to the other point: I do believe that you should not use a name in a
> > type-declarator if it isn't defined.
> 
> I like the idea better (I think proposed by Tim Peters) that the names
> used for type declarations live in a separate compile time namespace
> where different rules apply.  (Even though there are obvious
> correspondences, e.g. the names of defined or imported classes should
> probably be available both at compile time and at run time.)

I think we're going to want those names at runtime, which means they
should be defined at the time of their usage.

If we have a separate namespace, then I think the output of the compiler
will need a bit more magic because it would need to build that namespace
right at the beginning, for use by the code later on.  i.e. some prologue
which sets up the typedecl namespace.  That just doesn't strike me as a
good thing.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/