[Python-Dev] PEP 338 issue finalisation (was Re: 2.5 PEP)

Thu Feb 16 16:07:41 CET 2006

On 2/16/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Guido van Rossum wrote:
> > Do you have unit tests for everything? I believe I fixed a bug in the
> > code that reads a bytecode file (it wasn't skipping the timestamp).

[Hey, I thought I sent that just to you. Is python-dev really
interested in this?]

> I haven't worked the filesystem based tests into the unit tests yet, and even
> the manual tests I was using managed to leave out compiled bytecode files (as
> you noticed). I'll fix that.
>
> Given I do my testing on Linux, I probably still would have forgotten the 'rb'
> mode definitions on the relevant calls to open() though. . .

But running the unit tests on Windows would have revealed the problem.

> > +++ pep-0338.txt      (working copy)
> > -    The optional argument ``init_globals`` may be used to pre-populate
> > +    The optional argument ``init_globals`` may be a dictionary used to pre-populate
> >      the globals dictionary before the code is executed. The supplied
> >      dictionary will not be modified.
>
> I just realised that anything that's a legal argument to "dict.update" will
> work. I'll fix the function description in the PEP (and the docs patch as well).

I'm not sure that's a good idea -- you'll never be able to switch to a
different implementation then.

> > --- runpy.py  Wed Feb 15 15:56:07 2006
> >           def get_data(self, pathname):
> > !             # XXX Unfortunately PEP 302 assumes text data :-(
> > !             return open(pathname).read()
>
> Hmm.
>
> The PEP itself requests that a string be returned from get_data(), but doesn't
> require that the file be opened in text mode. Perhaps the PEP 302 emulation
> should use binary mode here? Otherwise there could be strange data corruption
> bugs on Windows.

But PEP 302 shows as its only example reading from a file with a .txt
extension. Adding spurious \r characters is also data corruption. We
should probably post to python-dev a request for clarification of PEP
302, but in the mean time I vote for text mode.

> > --- 337,349 ----
> >
> >   # This helper is needed as both the PEP 302 emulation and the
> >   # main file execution functions want to read compiled files
> > + # XXX marshal can also raise EOFError; perhaps that should be
> > + # turned into ValueError?  Some callers expect ValueError.
> >   def _read_compiled_file(compiled_file):
> >       magic = compiled_file.read(4)
> >       if magic != imp.get_magic():
> >           raise ValueError("File not compiled for this Python version")
> > +     compiled_file.read(4) # Throw away timestamp
> >       return marshal.load(compiled_file)
>
> I'm happy to convert EOFError to ValueError here if you'd prefer (using the
> string representation of the EOFError as the message in the ValueError).
>
> Or did you mean changing the behaviour in marshal itself?

No -- the alternative is to catch EOFError in _read_compiled_file()'s
caller, but that seems worse. You should check marshal.c if it can
raise any *other* errors (perhaps OverflowError?).

Also, *perhaps* it makes more sense to return None instead of raising
ValueError? Since you're always catching it? (Or are you?)

> > --- 392,407 ----
> >       loader = _get_loader(mod_name)
> >       if loader is None:
> >           raise ImportError("No module named " + mod_name)
> > +     # XXX get_code() is an *optional* loader feature. Is that okay?
> >       code = loader.get_code(mod_name)
>
> If the loader doesn't provide access to the code object or the source code,
> then runpy can't really do anything useful with that module (e.g. if its a C
> builtin module). Given that PEP 302 states that if you provide get_source()
> you should also provide get_code(), this check should be sufficient to let
> runpy.run_module get to everything it can handle.

OK. But a loader could return None from get_code() -- do you check for
that? (I don't have the source handy here.)

> A case could be made for converting the attribute error to an ImportError, I
> guess. . .

I'm generally not keen on that; leave it.

> >       filename = _get_filename(loader, mod_name)
> >       if run_name is None:
> >           run_name = mod_name
> > +     # else:
> > +         # XXX Should we also set sys.modules[run_name] = sys.modules[mod_name]?
> > +         #     I know of code that does "import __main__".  It should probably
> > +         #     get the substitute __main__ rather than the original __main__,
> > +         #     if run_name != mod_name
> >       return run_module_code(code, init_globals, run_name,
> >                              filename, loader, as_script)
>
> Hmm, good point. How about a different solution, though: in run_module_code, I
> could create a new module object and put it temporarily in sys.modules, and
> then remove it when done (restoring the original module if necessary).

That might work too.

What happens when you execute "foo.py" as __main__ and then (perhaps
indirectly) something does "import foo"? Does a second copy of foo.py
get loaded by the regular loader?

> That would mean any module with code that looks up "sys.modules[__name__]"
> would still work when run via runpy.run_module or runpy.run_file.

Yeah, except if they do that, they're likely to also *assign* to that.
Well, maybe that would just work, too...

> I also realised that sys.argv[0] should be restored to its original value, too.

Yup.

> I'd then change the "as_script" flag to "alter_sys" and have it affect both of
> the above operations (and grab the import lock to avoid other import or
> run_module_code operations seeing the altered version of sys.modules).

Makes sense.

I do wonder if runpy.py isn't getting a bit over-engineered -- it
seems a lot of the functionality isn't actually necessary to implement
-m foo.bar, and the usefulness in other circumstances is as yet
unproven. What do you think of taking a dose of YAGNI here?
(Especially since I notice that most of the public APIs are very thin
layers over exec or execfile -- people can just use those directly.)

> > --- 439,457 ----
> >
> >          Returns the resulting top level namespace dictionary
> >          First tries to run as a compiled file, then as a source file
> > +        XXX That's not the same algorithm as used by regular import;
> > +            if the timestamp in the compiled file is not equal to the
> > +            source file's mtime, the compiled file is ignored
> > +            (unless there is no source file -- then the timestamp
> > +            is ignored)
>
> They're doing different things though - the import process uses that algorithm
> to decide which filename to use (.pyo, .pyc or .py). This code in run_file is
> trying to decide whether the supplied filename points to a compiled file or a
> source file without a tight coupling to the specific file extension used (e.g.
> so it works for Unix Python scripts that rely on the shebang line to identify
> which interpreter to use to run them).
>
> I'll add a comment to that effect.

Ah, good point. So you never go from foo.py to foo.pyc, right?

> Another problem that occurred to me is that the module isn't thread safe at
> the moment. The PEP 302 emulation isn't protected by the import lock, and the
> changes to sys.argv in run_module_code will be visible across threads (and may
> clobber each other or the original if multiple threads invoke the function).

Another reason to consider cutting it down to only what's needed by
-m; -m doesn't need thread-safety (I think).

> On that front, I'll change _get_path_loader to acquire and release the import
> lock, and the same for run_module_code when "alter_sys" is set to True.

OK, just be very, very careful. The import lock is not a regular mutex
and if you don't release it you're stuck forever. Just use
try/finally...

--
--Guido van Rossum (home page: http://www.python.org/~guido/)