[Python-Dev] New Python Initialization API

Tue Apr 9 16:39:59 EDT 2019

On 05Apr2019 0912, Victor Stinner wrote:
> About PyPreConfig and encodings.
> [...]
>>> * ``PyInitError Py_PreInitialize(const PyPreConfig *config)``
>>> * ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config,
>> int argc, char **argv)``
>>> * ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig
>> *config, int argc, wchar_t **argv)``
>>
>> I hope to one day be able to support multiple runtimes per process - can
>> we have an opaque PyRuntime object exposed publicly now and passed into
>> these functions?
> 
> I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I
> chose to not do so.
> 
> Currently, there is a single global variable _PyRuntime which has the type
> _PyRuntimeState. The _PyRuntime_Initialize() API is designed around this
> global variable. For example, _PyRuntimeState contains the registry of
> interpreters: you don't want to have multiple registries :-)
> 
> I understood that we should only have a single instance of
> _PyRuntimeState. So IMHO it's fine to keep it private at this point.
> There is no need to expose it in the API.

So I didn't want to expose that particular object right now, but just 
some sort of "void*" parameter in the new APIs (and require either NULL 
or a known value be passed). That gives us the freedom to enable 
multiple runtimes in the future without having to change the API shape.

> FYI I tried to design an internal API with a "context" to pass
> _PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc.
> [...]
> There are 2 possible implementations:
> 
> * Modify *all* functions to add a new "context" parameter and modify *all*
>    functions to pass this parameter to sub-functions.
> * Store the current "context" as a thread local variable or something like
>    that.
> [...]
> For the second option: well, there is no API change needed!
> It can be done later.
> Moreover, we already have such API! PyThreadState_Get() gets the Python
> thread state of the current thread: the current interpreter can be
> accessed from there.

Yes, this is what I had in mind as a transition. I think eventually it 
would be best to have the context parameter, as thread-local variables 
have overhead and add significant complexity (particularly when 
debugging crashes), but making that change is huge.

>>> ``PyPreConfig`` fields:
>>>
>>> * ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale
>>>    is coerced.
>>> * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to
>>>    1, read the LC_CTYPE to decide if it should be coerced.
>>
>> Can we use another value for coerce_c_locale to determine whether to
>> warn or not? Save a field.
> 
> coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2.
> I prefer keep a separated field.
> 
> Moreover, I understood that you might want to coerce the C locale *and*
> get the warning, or get the warning but *not* coerce the locale.

If we define meaningful constants, then it doesn't matter how many 
values it has. We could have PY_COERCE_LOCALE_AND_WARN, 
PY_COERCE_LOCALE_SILENTLY, PY_WARN_WITHOUT_COERCE etc. to represent the 
states. These actually make things simpler than trying to reason about 
how two similar parameters interact.

>>> * ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the
>>>    Python filesystem encoding to ``"mbcs"``.
>>> * ``utf8_mode``: if non-zero, enable the UTF-8 mode
>>
>> Why not just set the encodings here?
> 
> For different technical reasons, you simply cannot specify an encoding
> name. You can also pass options to tell Python that you have some
> preferences (PyPreConfig and PyConfig fields).
> 
> Python doesn't support any encoding and encoding errors combinations. In
> practice, it only supports a narrow set of choices. The main implementation are
> Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec
> of the current locale encoding to implement the filesystem encoding,
> before the codec implemented in Python can be used.
> 
> Basically, only the current locale encoding or UTF-8 are supported.
> If you want UTF-8, enable the UTF-8 Mode.

If we already had a trivial way to specify the default encodings as a 
string before any initialization has occurred, I think we would have 
made UTF-8 mode enabled by setting them to "utf-8" rather than a brand 
new flag.

Again, we either have a huge set of flags to infer certain values at 
certain times, or we can just make them directly settable. If we make 
them settable, it's much easier for users to reason about what is going 
to happen.

> To load the Python codec, you need importlib. importlib needs to access
> the filesystem which requires a codec to encode/decode file names
> (PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API
> only supports bytes char* strings).

Right, and the few places where we need an encoding *before* we can load 
any arbitrary ones we can easily compare the strings and fail if 
someone's trying to do something unusual (or if the platform can do the 
lookup itself, it could succeed). If we say "passing NULL means use the 
default" then we have that handled, and the actual encoding just gets 
set to the real default once we figure out what that is.

> Py_PreInitialize() doesn't set the filesystem encoding. It initializes the
> LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and
> Py_LegacyWindowsFSEncodingFlag).

Right, I'm proposing a simplification here where it *does* set the 
filesystem encoding (even though it doesn't get used until 
Py_Initialize() is called). That way we can use the filesystem encoding 
to access the filesystem during initialization, provided it's one of the 
built-in supported ones (e.g. NULL, which means the C locale, or "utf-8" 
which means UTF-8) rather than relying on the tables in the standard 
library.

Oh look, I said all this in my original email:

>> Obviously we are not ready to import most encodings after pre
>> initialization, but I think that's okay. Embedders who set something
>> outside the range of what can be used without importing encodings will
>> get an error to that effect if we try.
> 
> You need a C implementation of the Python filesystem encoding very early
> in Python initialization. You cannot start with one encoding and "later"
> switch the encoding. I tried multiple times the last 10 years and I always
> failed to do that. All attempts failed with mojibake at different
> levels.

Again, this is for embedders. Regular Python users will only ever 
request "NULL" or "utf-8", depending on the UTF-8 mode flag. And 
embedders have to make sure they get what they ask for and also can't 
change it later.

The problems you've hit in the past have always been to do with trying 
to infer or guess the actual encoding, rather than simply letting 
someone tell you what it is (via config) and letting them deal with the 
failure.

>> In fact, I'd be totally okay with letting embedders specify their own
>> function pointer here to do encoding/decoding between Unicode and the OS
>> preferred encoding.
> 
> In my experience, when someone wants to get a specific encoding: they
> only want UTF-8. There is now the UTF-8 Mode which ignores the locale
> and forces the usage of UTF-8.

Your experience here sounds like it's limited to POSIX systems. I've 
wanted UTF-16 before, and been able to provide it (if Python had allowed 
me to provide a callback to encode/decode).

And again, all this is about "why do we need to define a boolean that 
determines what the encoding is when we can just let people tell us what 
encoding they want". There's a good chance that an embedded Python isn't 
going to touch the real filesystem anyway.

> I'm not sure that there is a need to have a custom codec. Moreover, if
> there an API to pass a codec in C, you will need to expose it somehow
> at the Python level for os.fsencode() and os.fsdecode().

We need to expose those operations anyway, and os.fsencode/fsdecode have 
their own issues (particularly since there *are* ways to change 
filesystem encoding while running). Turning them into actual native 
functions that might call out to a host-provided callback would not be 
difficult.

> Currently, Python ensures during early stage of startup that
> codecs.lookup(sys.getfilesystemencoding()) works: there is a existing
> Python codec for the requested filesystem encoding.

Right, it's a validation step. But we can also make 
codecs.lookup("whatever the file system encoding is") return something 
based on os.fsencode() and os.fsdecode(). We're not actually beholden to 
the current implementations here - we are allowed to change them! ;)