Mailman 3 PEP 432: Simplifying the CPython startup sequence - Python-ideas

Dec. 27, 2012

      After helping Brett with the migration to importlib in 3.3, and
looking at some of the ideas kicking around for additional CPython
features that would affect the startup sequence, I've come to the
conclusion that what we have now simply isn't sustainable long term.
It's already the case that if you use certain options (specifically -W
or -X), the interpreter will start accessing the C API before it has
called Py_Initialize(). It's not cool when other people do that (we'd
never accept code that behaved that way as a valid reproducer for a
bug report), and it's *definitely* not cool that we're doing it (even
though we seem to be getting away with it for the moment, and have
been for a long time).

The attached PEP is a first attempt at a plan for doing something
about it. (My notes at
http://wiki.python.org/moin/CPythonInterpreterInitialization provide
additional context - let me know if you think there's more material on
that page that should be in the PEP itself)

The PEP is also available online at http://www.python.org/dev/peps/pep-0432/

Cheers,
Nick.

PEP: 432
Title: Simplifying the CPython startup sequence
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 28-Dec-2012
Python-Version: 3.4
Post-History: 28-Dec-2012

Abstract
========

This PEP proposes a mechanism for simplifying the startup sequence for
CPython, making it easier to modify the initialisation behaviour of the
reference interpreter executable, as well as making it easier to control
CPython's startup behaviour when creating an alternate executable or
embedding it as a Python execution engine inside a larger application.

Proposal Summary
================

This PEP proposes that CPython move to an explicit 2-phase initialisation
process, where a preliminary interpreter is put in place with limited OS
interaction capabilities early in the startup sequence. This essential core
remains in place while all of the configuration settings are determined,
until a final configuration call takes those settings and finishes
bootstrapping the interpreter immediately before executing the main module.

As a concrete use case to help guide any design changes, and to solve a known
problem where the appropriate defaults for system utilities differ from those
for running user scripts, this PEP also proposes the creation and
distribution of a separate system Python (``spython``) executable which, by
default, ignores user site directories and environment variables, and does
not implicitly set ``sys.path[0]`` based on the current directory or the
script being executed.

Background
==========

Over time, CPython's initialisation sequence has become progressively more
complicated, offering more options, as well as performing more complex tasks
(such as configuring the Unicode settings for OS interfaces in Python 3 as
well as bootstrapping a pure Python implementation of the import system).

Much of this complexity is accessible only through the ``Py_Main`` and
``Py_Initialize`` APIs, offering embedding applications little opportunity
for customisation. This creeping complexity also makes life difficult for
maintainers, as much of the configuration needs to take place prior to the
``Py_Initialize`` call, meaning much of the Python C API cannot be used
safely.

A number of proposals are on the table for even *more* sophisticated
startup behaviour, such as better control over ``sys.path`` initialisation
(easily adding additional directories on the command line in a cross-platform
fashion, as well as controlling the configuration of ``sys.path[0]``), easier
configuration of utilities like coverage tracing when launching Python
subprocesses, and easier control of the encoding used for the standard IO
streams when embedding CPython in a larger application.

Rather than attempting to bolt such behaviour onto an already complicated
system, this PEP proposes to instead simplify the status quo *first*, with
the aim of making these further feature requests easier to implement.

Key Concerns
============

There are a couple of key concerns that any change to the startup sequence
needs to take into account.

Maintainability
---------------

The current CPython startup sequence is difficult to understand, and even
more difficult to modify. It is not clear what state the interpreter is in
while much of the initialisation code executes, leading to behaviour such
as lists, dictionaries and Unicode values being created prior to the call
to ``Py_Initialize`` when the ``-X`` or ``-W`` options are used [1_].

By moving to a 2-phase startup sequence, developers should only need to
understand which features are not available in the core bootstrapping state,
as the vast majority of the configuration process will now take place in
that state.

By basing the new design on a combination of C structures and Python
dictionaries, it should also be easier to modify the system in the
future to add new configuration options.

Performance
-----------

CPython is used heavily to run short scripts where the runtime is dominated
by the interpreter initialisation time. Any changes to the startup sequence
should minimise their impact on the startup overhead. (Given that the
overhead is dominated by IO operations, this is not currently expected to
cause any significant problems).

The Status Quo
==============

Much of the configuration of CPython is currently handled through C level
global variables::

    Py_IgnoreEnvironmentFlag
    Py_HashRandomizationFlag
    _Py_HashSecretInitialized
    _Py_HashSecret
    Py_BytesWarningFlag
    Py_DebugFlag
    Py_InspectFlag
    Py_InteractiveFlag
    Py_OptimizeFlag
    Py_DontWriteBytecodeFlag
    Py_NoUserSiteDirectory
    Py_NoSiteFlag
    Py_UnbufferedStdioFlag
    Py_VerboseFlag

For the above variables, the conversion of command line options and
environment variables to C global variables is handled by ``Py_Main``,
so each embedding application must set those appropriately in order to
change them from their defaults.

Some configuration can only be provided as OS level environment variables::

    PYTHONHASHSEED
    PYTHONSTARTUP
    PYTHONPATH
    PYTHONHOME
    PYTHONCASEOK
    PYTHONIOENCODING

Additional configuration is handled via separate API calls::

    Py_SetProgramName() (call before Py_Initialize())
    Py_SetPath() (optional, call before Py_Initialize())
    Py_SetPythonHome() (optional, call before Py_Initialize()???)
    Py_SetArgv[Ex]() (call after Py_Initialize())

The ``Py_InitializeEx()`` API also accepts a boolean flag to indicate
whether or not CPython's signal handlers should be installed.

Finally, some interactive behaviour (such as printing the introductory
banner) is triggered only when standard input is reported as a terminal
connection by the operating system.

Also see more detailed notes at [1_]

Proposal
========

(Note: details here are still very much in flux, but preliminary feedback
is appreciated anyway)

Core Interpreter Initialisation
-------------------------------

The only configuration that currently absolutely needs to be in place
before even the interpreter core can be initialised is the seed for the
randomised hash algorithm. However, there are a couple of settings needed
there: whether or not hash randomisation is enabled at all, and if it's
enabled, whether or not to use a specific seed value.

The proposed API for this step in the startup sequence is::

    void Py_BeginInitialization(Py_CoreConfig *config);

Like Py_Initialize, this part of the new API treats initialisation failures
as fatal errors. While that's still not particularly embedding friendly,
the operations in this step *really* shouldn't be failing, and changing them
to return error codes instead of aborting would be an even larger task than
the one already being proposed.

The new Py_CoreConfig struct holds the settings required for preliminary
configuration::

    typedef struct {
        int use_hash_seed;
        size_t hash_seed;
    } Py_CoreConfig;

To "disable" hash randomisation, set "use_hash_seed" and pass a hash seed of
zero. (This seems reasonable to me, but there may be security implications
I'm overlooking. If so, adding a separate flag or switching to a 3-valued
"no randomisation", "fixed hash seed" and "randomised hash" option is easy)

The core configuration settings pointer may be NULL, in which case the
default behaviour of randomised hashes with a random seed will be used.

A new query API will allow code to determine if the interpreter is in the
bootstrapping state between core initialisation and the completion of the
initialisation process::

    int Py_IsInitializing();

While in the initialising state, the interpreter should be fully functional
except that:

* compilation is not allowed (as the parser is not yet configured properly)
* The following attributes in the ``sys`` module are all either missing or
  ``None``:
  * ``sys.path``
  * ``sys.argv``
  * ``sys.executable``
  * ``sys.base_exec_prefix``
  * ``sys.base_prefix``
  * ``sys.exec_prefix``
  * ``sys.prefix``
  * ``sys.warnoptions``
  * ``sys.flags``
  * ``sys.dont_write_bytecode``
  * ``sys.stdin``
  * ``sys.stdout``
* The filesystem encoding is not yet defined
* The IO encoding is not yet defined
* CPython signal handlers are not yet installed
* only builtin and frozen modules may be imported (due to above limitations)
* ``sys.stderr`` is set to a temporary IO object using unbuffered binary
  mode
* The ``warnings`` module is not yet initialised
* The ``__main__`` module does not yet exist

<TBD: identify any other notable missing functionality>

The main things made available by this step will be the core Python
datatypes, in particular dictionaries, lists and strings. This allows them
to be used safely for all of the remaining configuration steps (unlike the
status quo).

In addition, the current thread will possess a valid Python thread state,
allow any further configuration data to be stored on the interpreter object
rather than in C process globals.

Any call to Py_BeginInitialization() must have a matching call to
Py_Finalize(). It is acceptable to skip calling Py_EndInitialization() in
between (e.g. if attempting to read the configuration settings fails)

Determining the remaining configuration settings
------------------------------------------------

The next step in the initialisation sequence is to determine the full
settings needed to complete the process. No changes are made to the
interpreter state at this point. The core API for this step is::

    int Py_ReadConfiguration(PyObject *config);

The config argument should be a pointer to a Python dictionary. For any
supported configuration setting already in the dictionary, CPython will
sanity check the supplied value, but otherwise accept it as correct.

Unlike Py_Initialize and Py_BeginInitialization, this call will raise an
exception and report an error return rather than exhibiting fatal errors if
a problem is found with the config data.

Any supported configuration setting which is not already set will be
populated appropriately. The default configuration can be overridden
entirely by setting the value *before* calling Py_ReadConfiguration. The
provided value will then also be used in calculating any settings derived
from that value.

Alternatively, settings may be overridden *after* the Py_ReadConfiguration
call (this can be useful if an embedding application wants to adjust
a setting rather than replace it completely, such as removing
``sys.path[0]``).

Supported configuration settings
--------------------------------

At least the following configuration settings will be supported::

    raw_argv (list of str, default = retrieved from OS APIs)

    argv (list of str, default = derived from raw_argv)
    warnoptions (list of str, default = derived from raw_argv and environment)
    xoptions (list of str, default = derived from raw_argv and environment)

    program_name (str, default = retrieved from OS APIs)
    executable (str, default = derived from program_name)
    home (str, default = complicated!)
    prefix (str, default = complicated!)
    exec_prefix (str, default = complicated!)
    base_prefix (str, default = complicated!)
    base_exec_prefix (str, default = complicated!)
    path (list of str, default = complicated!)

    io_encoding (str, default = derived from environment or OS APIs)
    fs_encoding (str, default = derived from OS APIs)

    skip_signal_handlers (boolean, default = derived from environment or False)
    ignore_environment (boolean, default = derived from environment or False)
    dont_write_bytecode (boolean, default = derived from environment or False)
    no_site (boolean, default = derived from environment or False)
    no_user_site (boolean, default = derived from environment or False)
    <TBD: at least more from sys.flags need to go here>

Completing the interpreter initialisation
-----------------------------------------

The final step in the process is to actually put the configuration settings
into effect and finish bootstrapping the interpreter up to full operation::

    int Py_EndInitialization(PyObject *config);

Like Py_ReadConfiguration, this call will raise an exception and report an
error return rather than exhibiting fatal errors if a problem is found with
the config data.

After a successful call, Py_IsInitializing() will be false, while
Py_IsInitialized() will become true. The caveats described above for the
interpreter during the initialisation phase will no longer hold.

Stable ABI
----------

All of the APIs proposed in this PEP are excluded from the stable ABI, as
embedding a Python interpreter involves a much higher degree of coupling
than merely writing an extension.

Backwards Compatibility
-----------------------

Backwards compatibility will be preserved primarily by ensuring that
Py_ReadConfiguration() interrogates all the previously defined configuration
settings stored in global variables and environment variables.

One acknowledged incompatiblity is that some environment variables which
are currently read lazily may instead be read once during interpreter
initialisation. As the PEP matures, these will be discussed in more detail
on a case by case basis.

The Py_Initialize() style of initialisation will continue to be
supported. It will use
the new API internally, but will continue to exhibit the same
behaviour as it does today,
ensuring that sys.argv is not set until a subsequent PySys_SetArgv call.

A System Python Executable
==========================

When executing system utilities with administrative access to a system, many
of the default behaviours of CPython are undesirable, as they may allow
untrusted code to execute with elevated privileges. The most problematic
aspects are the fact that user site directories are enabled,
environment variables are trusted and that the directory containing the
executed file is placed at the beginning of the import path.

Currently, providing a separate executable with different default behaviour
would be prohibitively hard to maintain. One of the goals of this PEP is to
make it possible to replace much of the hard to maintain bootstrapping code
with more normal CPython code, as well as making it easier for a separate
application to make use of key components of ``Py_Main``. Including this
change in the PEP is designed to help avoid acceptance of a design that
sounds good in theory but proves to be problematic in practice.

One final aspect not addressed by the general embedding changes above is
the current inaccessibility of the core logic for deciding between the
different execution modes supported by CPython:

* script execution
* directory/zipfile execution
* command execution ("-c" switch)
* module or package execution ("-m" switch)
* execution from stdin (non-interactive)
* interactive stdin

<TBD: concrete proposal for better exposing the __main__ execution step>

Implementation
==============

None as yet. Once I have a reasonably solid plan of attack, I intend to work
on a reference implementation as a feature branch in my BitBucket sandbox [2_]

References
==========

.. [1] CPython interpreter initialization notes
   (http://wiki.python.org/moin/CPythonInterpreterInitialization)

.. [2] BitBucket Sandbox
   (https://bitbucket.org/ncoghlan/cpython_sandbox)

Copyright
===========
This document has been placed in the public domain.

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia

PEP 432: Simplifying the CPython startup sequence

Nick Coghlan

Benjamin Peterson

Eric Snow

Yuval Greenfield

Yuval Greenfield

Antoine Pitrou

Antoine Pitrou

Christian Heimes

Nick Coghlan

Antoine Pitrou

Nick Coghlan

Mark Shannon

Nick Coghlan

Benjamin Peterson

Eric Snow

Yuval Greenfield

Yuval Greenfield

Antoine Pitrou

Antoine Pitrou

Christian Heimes

Nick Coghlan

Antoine Pitrou

Nick Coghlan

Mark Shannon

Nick Coghlan

tags

participants (7)