[Python-Dev] New Python Initialization API

Fri Apr 5 12:12:50 EDT 2019

About PyPreConfig and encodings.

> The appendix is excellent, by the way. Very useful detail to have
> written down.

Thanks. The appendix is based on Include/cpython/coreconfig.h comments
which is now my reference documentation. By the way, this header file
contains more information about PyConfig fields than the PEP 587. For
example, the comment on filesystem_encoding and filesystem_errors lists
every single cases and exceptions (it describes the implementation).

> > ``PyPreConfig`` structure is used to pre-initialize Python:
> >
> > * Set the memory allocator
> > * Configure the LC_CTYPE locale
> > * Set the UTF-8 mode
>
> I think we should have the isolated flag in here - oh wait, we do - I
> think we should have the isolated/use_environment options listed in this
> list :)

My introduction paragraph only explains the changes made by
Py_PreInitialize(): calling Py_PreInitialize() doesn't "isolate"
Python.

PyPreConfig.isolated is used to decide if Python reads environment
variables or not. Examples: PYTHONMALLOC, PYTHONUTF8, PYTHONDEVMODE (which
has an impact on PyPreConfig.allocator), PYTHONCOERCECLOCALE, etc.

That's why isolated and use_environment are present in PyPreConfig and
PyConfig. In practice, values should be equal in both structures.
Moreover, if PyConfig.isolated is equal to 1, Py_InitializeFromConfig()
updates _PyRuntime.preconfig.isolated to 1 ;-)

> > * ``PyInitError Py_PreInitialize(const PyPreConfig *config)``
> > * ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config,
> int argc, char **argv)``
> > * ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig
> *config, int argc, wchar_t **argv)``
>
> I hope to one day be able to support multiple runtimes per process - can
> we have an opaque PyRuntime object exposed publicly now and passed into
> these functions?

I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I
chose to not do so.

Currently, there is a single global variable _PyRuntime which has the type
_PyRuntimeState. The _PyRuntime_Initialize() API is designed around this
global variable. For example, _PyRuntimeState contains the registry of
interpreters: you don't want to have multiple registries :-)

I understood that we should only have a single instance of
_PyRuntimeState. So IMHO it's fine to keep it private at this point.
There is no need to expose it in the API.

> (FWIW, I think we're a long way from being able to support multiple
> runtimes *simultaneously*, so the initial implementation here would be
> to have a PyRuntime_Create() that returns our global one once and then
> errors until it's finalised. The migration path is probably to enable
> switching of the current runtime via a dedicated function (i.e. one
> active at a time, probably with thread local storage), since we have no
> "context" parameter for C API functions, and obviously there are still
> complexities such as poorly written extension modules that nonetheless
> can be dealt with in embedding scenarios by simply not using them. This
> doesn't seem like an unrealistic future, *unless* we add a whole lot of
> new APIs now that can't allow it :) )

FYI I tried to design an internal API with a "context" to pass
_PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc.

=> https://bugs.python.org/issue35265

My first need was to pass a memory allocator to Py_DecodeLocale().

There are 2 possible implementations:

* Modify *all* functions to add a new "context" parameter and modify *all*
  functions to pass this parameter to sub-functions.
* Store the current "context" as a thread local variable or something like
  that.

I wrote a proof-of-concept of the first option: the implementation was
very painful to write: a lot of changes which looks useless and a lot
of new private functions which to pass the argument. I had to modify
way too much code. I gave up.

For the second option: well, there is no API change needed!
It can be done later.
Moreover, we already have such API! PyThreadState_Get() gets the Python
thread state of the current thread: the current interpreter can be
accessed from there.

> > ``PyPreConfig`` fields:
> >
> > * ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale
> >   is coerced.
> > * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to
> >   1, read the LC_CTYPE to decide if it should be coerced.
>
> Can we use another value for coerce_c_locale to determine whether to
> warn or not? Save a field.

coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2.
I prefer keep a separated field.

Moreover, I understood that you might want to coerce the C locale *and*
get the warning, or get the warning but *not* coerce the locale.

> > * ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the
> >   Python filesystem encoding to ``"mbcs"``.
> > * ``utf8_mode``: if non-zero, enable the UTF-8 mode
>
> Why not just set the encodings here?

For different technical reasons, you simply cannot specify an encoding
name. You can also pass options to tell Python that you have some
preferences (PyPreConfig and PyConfig fields).

Python doesn't support any encoding and encoding errors combinations. In
practice, it only supports a narrow set of choices. The main implementation are
Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec
of the current locale encoding to implement the filesystem encoding,
before the codec implemented in Python can be used.

Basically, only the current locale encoding or UTF-8 are supported.
If you want UTF-8, enable the UTF-8 Mode.

To load the Python codec, you need importlib. importlib needs to access
the filesystem which requires a codec to encode/decode file names
(PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API
only supports bytes char* strings).

Py_PreInitialize() doesn't set the filesystem encoding. It initializes the
LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and
Py_LegacyWindowsFSEncodingFlag).

> Obviously we are not ready to import most encodings after pre
> initialization, but I think that's okay. Embedders who set something
> outside the range of what can be used without importing encodings will
> get an error to that effect if we try.

You need a C implementation of the Python filesystem encoding very early
in Python initialization. You cannot start with one encoding and "later"
switch the encoding. I tried multiple times the last 10 years and I always
failed to do that. All attempts failed with mojibake at different
levels.

Unix pays the price of its history. Windows is a very different story:
there are API to access the filesystem with Unicode strings,
there is no such "bootstrap problem" for importlib.

> In fact, I'd be totally okay with letting embedders specify their own
> function pointer here to do encoding/decoding between Unicode and the OS
> preferred encoding.

In my experience, when someone wants to get a specific encoding: they
only want UTF-8. There is now the UTF-8 Mode which ignores the locale
and forces the usage of UTF-8.

I'm not sure that there is a need to have a custom codec. Moreover, if
there an API to pass a codec in C, you will need to expose it somehow
at the Python level for os.fsencode() and os.fsdecode().

Currently, Python ensures during early stage of startup that
codecs.lookup(sys.getfilesystemencoding()) works: there is a existing
Python codec for the requested filesystem encoding.

Victor