New Python Initialization API

Hi, I would like to add a new C API to initialize Python. I would like your opinion on the whole API before making it public. The code is already implemented. Doc of the new API: https://pythondev.readthedocs.io/init_config.html To make the API public, _PyWstrList, _PyInitError, _PyPreConfig, _PyCoreConfig and related functions should be made public. By the way, I would suggest to rename "_PyCoreConfig" to just "PyConfig" :-) I don't think that "core init" vs "main init" is really relevant: more about that below. Let's start with two examples using the new API. Example of simple initialization to enable isolated mode: _PyCoreConfig config = _PyCoreConfig_INIT; config.isolated = 1; _PyInitError err = _Py_InitializeFromConfig(&config); if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); } /* ... use Python API here ... */ Py_Finalize(); Example using the pre-initialization to enable the UTF-8 Mode (and use the "legacy" Py_Initialize() function): _PyPreConfig preconfig = _PyPreConfig_INIT; preconfig.utf8_mode = 1; _PyInitError err = _Py_PreInitialize(&preconfig); if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); } /* at this point, Python will only speak UTF-8 */ Py_Initialize(); /* ... use Python API here ... */ Py_Finalize(); Since November 2017, I'm refactoring the Python Initialization code to cleanup the code and prepare a new ("better") API to configure Python Initialization. I just fixed the last issues that Nick Coghlan asked me to fix (add a pre-initialization step: done, fix mojibake: done). My work is inspired by Nick Coghlan's PEP 432, but it is not implementing it directly. I had other motivations than Nick even if we are somehow going towards the same direction. Nick wants to get a half-initialized Python ("core init"), configure Python using the Python API and Python objects, and then finish the implementation ("main init"). I chose a different approach: put *everything* into a single C structure (_PyCoreConfig) using C types. Using the structure, you should be able to do what Nick wanted to do, but with C rather than Python. Nick: please tell me if I'm wrong :-) This work is also connected to Eric Snow's work on sub-interpreters (PEP 554) and moving global variables into structures. For example, I'm using his _PyRuntime structure to store a new "preconfig" state (pre-initialization configuration, more about that below). In November 2017, when I started to work on the Python Initialization (bpo-32030), I identified the following problems: * Many parts of the code were interdependent * Code executed early in Py_Main() used the Python API before the Python API was fully initialized. Like code parsing -W command line option which used PyUnicode_FromWideChar() and PyList_Append(). * Error handling used Py_FatalError() which didn't let the caller to decide how to handle the error. Moreover, exit() was used to exit Python, whereas libpython shouldn't do that: a library should not exit the whole process! (imagine when Python is embedded inside an application) One year and a half later, I implemented the following solutions: * Py_Main() and Py_Initialize() code has been reorganized to respect priorities between global configuration variables (ex: Py_IgnoreEnvironmentFlag), environment variables (ex: PYTHONPATH), command line arguments (ex: -X utf8), configuration files (ex: pyenv.cfg), and the new _PyPreConfig and _PyCoreConfig structures which store the whole configuration. * Python Initialization no longer uses the Python API but only C types like wchar_t* strings, a new _PyWstrList structure and PyMem_RawMalloc() memory allocator (PyMem_Malloc() is no longer used during init). * The code has been modified to use a new _PyInitError structure. The caller of the top function gets control to cleanup everything before handling the error (display a fatal error message or simply exit Python). The new _PyCoreConfig structure has the top-priority and provides a single structure for all configuration parameters. It becomes possible to override the code computing the "path configuration" like sys.path to fully control where Python looks to import modules. It becomes possible to use an empty list of paths to only allow builtin modules. A new "pre-initialization" steps is responsible to configure the bare minimum before the Python initialization: memory allocators and encodings (LC_CTYPE locale and the UTF-8 mode). The LC_CTYPE is no longer coerced and the UTF-8 Mode is no longer enabled automatically depending on the user configuration to prevent mojibake. Previously, calling Py_DecodeLocale() to get a Unicode wchar_t* string from a bytes wchar* string created mojibake when called before Py_Initialize() if the LC_CTYPE locale was coerced and/or if the UTF-8 Mode was enabled. The pre-initialization step ensures that the encodings and memory allocators are well defined *before* Py_Initialize() is called. Since the new API is currently private, I didn't document it in Python. Moreover, the code changed a lot last year :-) But it should now be way more stable. I started to document it in a separated webpage: https://pythondev.readthedocs.io/init_config.html The plan is to put it in the Python documentation once it becomes public. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Le mer. 27 mars 2019 à 19:35, Alexander Belopolsky <alexander.belopolsky@gmail.com> a écrit :
Would you consider making _Py_UnixMain public as well?
It is useful for high level embedding and not trivial for 3rd parties to reimplement.
Right, Py_Main() is causing a lot of practice issues, especially mojibake because of the C locale coercion (PEP 538) and UTF-8 Mode (PEP 540): both added in Python 3.7. I added that to the Rationale of my PEP 587. I just fixed the mojibake issue in Python 3.8 by disabling C locale coercion and UTF-8 Mode by default. I'm not sure if nor how Python 3.7 should be fixed in a minor 3.7.x release. Making _Py_UnixMain() public has already been discussed here: https://discuss.python.org/t/adding-char-based-apis-for-unix/916 My PEP 587 allows to pass command line arguments as bytes (char*) or Unicode (wchar_t*). Ok, I just added Py_UnixMain() to the PEP (just make it part of the public API). Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Victor Stinner writes:
I just fixed the mojibake issue in Python 3.8 by disabling C locale coercion and UTF-8 Mode by default. I'm not sure if nor how Python 3.7 should be fixed in a minor 3.7.x release.
That sounds like a potential regression. Those two features were added *and turned on by default* (which really means "if you detect LC_TYPE=C, coerce") to relieve previously existing mojibake/ UnicodeError issues due to ASCII-only environments that are difficult to configure (such as containers). Turning them on by default was the controversial part -- it was known that on or off, some environments would have problems, and that's why they needed PEPs. Do those issues return now? If so, where is the PEP rationale for defaulting to "on" faulty?

Le jeu. 28 mars 2019 à 05:27, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> a écrit :
Victor Stinner writes:
I just fixed the mojibake issue in Python 3.8 by disabling C locale coercion and UTF-8 Mode by default. I'm not sure if nor how Python 3.7 should be fixed in a minor 3.7.x release.
That sounds like a potential regression. Those two features were added *and turned on by default* (which really means "if you detect LC_TYPE=C, coerce") to relieve previously existing mojibake/ UnicodeError issues due to ASCII-only environments that are difficult to configure (such as containers). Turning them on by default was the controversial part -- it was known that on or off, some environments would have problems, and that's why they needed PEPs. Do those issues return now? If so, where is the PEP rationale for defaulting to "on" faulty?
If you use "python3.8", there is no change. I'm only talking about the specific case of Python embedded in an application: when you use the C API. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 27Mar2019 1048, Victor Stinner wrote:
Since November 2017, I'm refactoring the Python Initialization code to cleanup the code and prepare a new ("better") API to configure Python Initialization. I just fixed the last issues that Nick Coghlan asked me to fix (add a pre-initialization step: done, fix mojibake: done). My work is inspired by Nick Coghlan's PEP 432, but it is not implementing it directly. I had other motivations than Nick even if we are somehow going towards the same direction.
I this this should be its own PEP, since as you say it is not implementing the only PEP we have (or alternatively, maybe you should collaborate to update PEP 432 so that it reflects what you think we ought to be implementing). Having formal writeups of both ideas is important to help decide between the two. It's not good to overrule a PEP by pretending that your change isn't big enough to need its own. (Not trying to devalue the work you've been doing so far, since it's great! But internal changes are one thing, while updating the public, documented interfaces deserves a more thorough process.) Cheers, Steve

On Wed, Mar 27, 2019 at 12:39 PM Steve Dower <steve.dower@python.org> wrote:
On 27Mar2019 1048, Victor Stinner wrote:
Since November 2017, I'm refactoring the Python Initialization code to cleanup the code and prepare a new ("better") API to configure Python Initialization. I just fixed the last issues that Nick Coghlan asked me to fix (add a pre-initialization step: done, fix mojibake: done). My work is inspired by Nick Coghlan's PEP 432, but it is not implementing it directly. I had other motivations than Nick even if we are somehow going towards the same direction.
I this this should be its own PEP, since as you say it is not implementing the only PEP we have (or alternatively, maybe you should collaborate to update PEP 432 so that it reflects what you think we ought to be implementing).
I agree that if this isn't doing what PEP 432 set out but going its own way we should probably discuss in regards to 432. -Brett
Having formal writeups of both ideas is important to help decide between the two. It's not good to overrule a PEP by pretending that your change isn't big enough to need its own.
(Not trying to devalue the work you've been doing so far, since it's great! But internal changes are one thing, while updating the public, documented interfaces deserves a more thorough process.)
Cheers, Steve _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Le mer. 27 mars 2019 à 21:26, Brett Cannon <brett@python.org> a écrit :
On Wed, Mar 27, 2019 at 12:39 PM Steve Dower <steve.dower@python.org> wrote:
I this this should be its own PEP, since as you say it is not implementing the only PEP we have (or alternatively, maybe you should collaborate to update PEP 432 so that it reflects what you think we ought to be implementing).
I agree that if this isn't doing what PEP 432 set out but going its own way we should probably discuss in regards to 432.
I'm sorry, I was in a hurry when I wrote the new PEP 587 and it seems like I created some confusion. My PEP 587 is very similar to the PEP 432, because it is basically an implementation of the PEP 432 design. I am collaborating closely with Nick Coghlan and Eric Snow on the Python Initialization for 1 year and a half, and I just continued the work they started. The PEP 432 has been written in 2012 and has a "broader scope". Since the PEP has been written, the code has been modified slowly towards PEP 432 design, but not "exactly" the expected design, because of concrete practical issues of the implementation. The PEP 587 is the updated *implementation* of the PEP 432 which can be seen as the overall *design*. The PEP 587 is only a subset of the PEP 432: C API to initialize Python, whereas PEP 432 goes further by introducing the concepts of "core" and "main" initialization. The "core initialization" is a bare minimum working Python only with builtin types, partial sys module and no importlib. "Main initialization" is a fully working Python. This part is out of the scope of the PEP 587, but the PEP 587 should be flexible enough to allow to implement it later. In fact, there is already a PyConfig._init_config flag (currently named _PyCoreConfig._init_main) which only initializes Python up to the "core initialization" if set to 0. This parameter is private since it's unclear to me what should be the exact scope of "core" vs "main" init. I wrote a PR to clarify the relationship between the PEP 587 and the PEP 432: https://github.com/python/peps/pull/955/files Victor

On 28Mar2019 0703, Victor Stinner wrote:
In fact, there is already a PyConfig._init_config flag (currently named _PyCoreConfig._init_main) which only initializes Python up to the "core initialization" if set to 0. This parameter is private since it's unclear to me what should be the exact scope of "core" vs "main" init.
We tried to set up a video call between the interested people (Eric, Nick, myself, yourself, couple of others) to clarify this point, and you refused to join ;) That said, the call never happened (honestly, there's not a lot of point in doing it without you being part of it), so we still don't have a clear idea of where the line should be drawn. But there are enough of us with fuzzy but valid ideas in our heads that we really need that brainstorming session to mix them together and find something feasible. Maybe we're best to put it off until PyCon at this point. Cheers, Steve

The purpose of the PEP 587 is to have a working document so everyone can look at the proposed API (stay focused to the API rather than bothering with the implementation). IMHO it's now time to get more people looking at the Python Initialization.
But there are enough of us with fuzzy but valid ideas in our heads that we really need that brainstorming session to mix them together and find something feasible. Maybe we're best to put it off until PyCon at this point.
Python 3.8 feature freeze is scheduled at the end of May, less than one month after the PyCon. It might be a little bit too late, no? Would you mind to elaborate these ideas? Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On 29Mar.2019 1830, Victor Stinner wrote:
The purpose of the PEP 587 is to have a working document so everyone can look at the proposed API (stay focused to the API rather than bothering with the implementation). IMHO it's now time to get more people looking at the Python Initialization.
But there are enough of us with fuzzy but valid ideas in our heads that we really need that brainstorming session to mix them together and find something feasible. Maybe we're best to put it off until PyCon at this point.
Python 3.8 feature freeze is scheduled at the end of May, less than one month after the PyCon. It might be a little bit too late, no?
I don't think we want to rush this in for 3.8 at this point anyway. The design of how Python is embedded is one of those things that could drastically affect the scenarios it gets used for in the future (probably half of my tasks at work right now involve embedding CPython), so I'd like to get it right.
Would you mind to elaborate these ideas?
I'd love to, but I don't have them all straight right now, and one of the problems with putting them in writing is I don't get immediate feedback when I'm not being clear enough or if there is missing context. I know you personally have seen most of my ideas, because I keep pinging you on them ;) My big one is what I posted on capi-sig about being able to classify our APIs better and define scenarios where they are ready for use, as well as breaking up unnecessary dependencies so that embedders have more flexibility (the rings and layers post). I posted a few examples of how initialization "could" be on various bugs I've had to deal with relating to it, and obviously I've been pushing the embeddable distro for Windows for a while (which is surprisingly popular with a very specific subset of users), as well as using it myself, so there are things that just annoy me enough about what we currently have. But I really do think this should start as a high bandwidth, in-person brainstorm session to get through the first few big scenarios. Then it'll be easy to open those up to review and let anyone submit their needs for hosting Python. And once we've collated a good set of "needs" we'll have a chance of designing the configuration and initialization APIs that will satisfy most/all of them. Maybe in time for 3.9 (or 3.10, if our RM gets the accelerated release cycle he wants ;) ). I personally think being able to embed Python easily and safely in other applications will be a powerful feature that will allow many non-developers to write code to get their work done, as we already see with Jupyter (and family). More are coming, but the responsibility is on us to make it successful. I want to get it right. Cheers, Steve

On Sat, 30 Mar 2019 at 12:45, Steve Dower <steve.dower@python.org> wrote:
On 29Mar.2019 1830, Victor Stinner wrote:
The purpose of the PEP 587 is to have a working document so everyone can look at the proposed API (stay focused to the API rather than bothering with the implementation). IMHO it's now time to get more people looking at the Python Initialization.
But there are enough of us with fuzzy but valid ideas in our heads that we really need that brainstorming session to mix them together and find something feasible. Maybe we're best to put it off until PyCon at this point.
Python 3.8 feature freeze is scheduled at the end of May, less than one month after the PyCon. It might be a little bit too late, no?
I don't think we want to rush this in for 3.8 at this point anyway. The design of how Python is embedded is one of those things that could drastically affect the scenarios it gets used for in the future (probably half of my tasks at work right now involve embedding CPython), so I'd like to get it right.
Victor and I chatted about this, and I think it would be good to get something in to Python 3.8 that gives applications embedding CPython access to the same level of control that we have from the native CPython CLI application - the long and the short of the *nix embedding bug reports that have come in since Python 3.7 was released is that locale coercion and UTF-8 mode don't quite work properly when an application is relying solely on the Py_Initialize() and Py_Main() APIs and doesn't have access to the extra preconfiguration steps that have been added to get everything to work nicely together and avoid mojibake in the native CPython CLI app. Victor's gone to great lengths to try to make them work, but the unfortunate fact is that by the time they're called, too many other things have often happened in the embedding application for CPython to be able to get all the data completely self-consistent. Thus the two changes go hand in hand: reverting the old initialization APIs back to their Python 3.6 behaviour to fix the embedding regressions our two PEPs inadvertently introduced for some applications when running in the POSIX locale, while also exposing new initialization APIs so embedding apps can explicitly opt in to behaving the same way as the CPython CLI does. Affected apps would then switch to Python 3.8 at the earliest opportunity, and stop supporting Python 3.7 as the embedded Python version. The absolute bare minimum version of PEP 587 that we need for that purpose is to expose the PreInitialize API, as that's the one that allows the active text encoding to be set early enough to avoid mojibake: https://www.python.org/dev/peps/pep-0587/#pre-initialization-with-pypreconfi... The rest of the proposal in PEP 587 then comes from wanting to publish an API that matches the one we're now using ourselves, rather than coming up with something more speculative. However, I was also going to suggest that you (Steve) might make a good BDFL-Delegate for these PEPs - there aren't that many of us familiar with this part of the code base, and Victor, Eric, and I are all way too close to the concrete API design to judge it objectively, while you not only maintain the embeddable CPython bundle for Windows, you also have access to users of that bundle that might be able to provide you with additional feedback :) Cheers, Nick. P.S. I've also posted a draft update to PEP 432 that modifies it to reflect Victor's extraction of the part we already have as a private API to PEP 587:https://github.com/python/peps/pull/965 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Here is my first review of https://www.python.org/dev/peps/pep-0587/ and in general I think it's very good. There are some things I'd like to consider changing before we embed them permanently in a public API, and as I'm still keen to have the ability to write Python code for configuration (to replace getpath.c, for example) I have a bias towards making that work more smoothly with these changes when we get to it. I think my biggest point (about halfway down) is that I'd rather use argv/environ/etc. to *initialize* PyConfig and then only use the config for initializing the runtime. That way it's more transparent for users and more difficult for us to add options that embedders can't access. The appendix is excellent, by the way. Very useful detail to have written down. Cheers, Steve
``PyWideCharList`` is a list of ``wchar_t*`` strings.
I always forget whether "const" is valid in C99, but if it is, can we make this a list of const strings? I also prefer a name like ``PyWideStringList``, since that's what it is (the other places we use WideChar in the C API refer only to a single string, as far as I'm aware).
``PyInitError`` is a structure to store an error message or an exit code for the Python Initialization.
I love this struct! Currently it's private, but I wonder whether it's worth making it public as PyError (or PyErrorInfo)? We obviously can't replace all uses of int as a return value throughout the API, but I think it has uses elsewhere and we may as well protect against having to rename in the future.
* ``exitcode`` (``int``): if greater or equal to zero, argument passed to ``exit()``
Windows is likely to need/want negative exit codes, as many system errors are represented as 0x80000000|(source of error)|(specific code).
* ``user_err`` (int): if non-zero, the error is caused by the user configuration, otherwise it's an internal Python error.
Maybe we could just encode this as "positive exitcode is user error, negative is internal error"? I'm pretty sure struct return values are passed by reference in most C calling conventions, so the size of the struct isn't a big deal, but without a proper bool type it may look like this is a second error code (like errno/winerror in a few places).
``PyPreConfig`` structure is used to pre-initialize Python:
* Set the memory allocator * Configure the LC_CTYPE locale * Set the UTF-8 mode
I think we should have the isolated flag in here - oh wait, we do - I think we should have the isolated/use_environment options listed in this list :)
Functions to pre-initialize Python:
* ``PyInitError Py_PreInitialize(const PyPreConfig *config)`` * ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config, int argc, char **argv)`` * ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig *config, int argc, wchar_t **argv)``
I hope to one day be able to support multiple runtimes per process - can we have an opaque PyRuntime object exposed publicly now and passed into these functions? (FWIW, I think we're a long way from being able to support multiple runtimes *simultaneously*, so the initial implementation here would be to have a PyRuntime_Create() that returns our global one once and then errors until it's finalised. The migration path is probably to enable switching of the current runtime via a dedicated function (i.e. one active at a time, probably with thread local storage), since we have no "context" parameter for C API functions, and obviously there are still complexities such as poorly written extension modules that nonetheless can be dealt with in embedding scenarios by simply not using them. This doesn't seem like an unrealistic future, *unless* we add a whole lot of new APIs now that can't allow it :) )
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
* ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the Python filesystem encoding to ``"mbcs"``. * ``utf8_mode``: if non-zero, enable the UTF-8 mode
Why not just set the encodings here? The "PreInitializeFromArgs" functions can override it based on the other variables we have, and embedders have a more obvious question to answer than "do I want legacy behaviour in my app". Obviously we are not ready to import most encodings after pre initialization, but I think that's okay. Embedders who set something outside the range of what can be used without importing encodings will get an error to that effect if we try. In fact, I'd be totally okay with letting embedders specify their own function pointer here to do encoding/decoding between Unicode and the OS preferred encoding. That would let them use any approach they like. Similarly for otherwise-unprintable messages (On an earlier project when I was embedding Python into Windows Store apps - back when their API was more heavily restricted - this would have been very helpful.)
Example of Python initialization enabling the isolated mode::
PyConfig config = PyConfig_INIT; config.isolated = 1;
Haven't we already used extenal values by this point that should have been isolated? I'd rather have the isolation up front. Or better yet, make isolation the default unless you call one of the "FromArgs" functions, and then we don't actually need the config setting at all.
PyConfig fields:
Before I start discussing individual groups of fields, I would really like to see the following significant change here (because it'll help keep us honest and prevent us breaking embedders in the future). Currently you have three functions, that take a PyConfig and optionally also use the environment/argv to figure out the settings:
* ``PyInitError Py_InitializeFromConfig(const PyConfig *config)`` * ``PyInitError Py_InitializeFromArgs(const PyConfig *config, int argc, char **argv)`` * ``PyInitError Py_InitializeFromWideArgs(const PyConfig *config, int argc, wchar_t **argv)``
I would much prefer to see this flipped around, so that there is one initialize function taking PyConfig, and two functions that will fill out the PyConfig based on the environment: (note two of the "const"s are gone) * ``PyInitError Py_SetConfigFromArgs(PyConfig *config, int argc, char **argv)`` * ``PyInitError Py_SetConfigFromWideArgs(PyConfig *config, int argc, wchar_t **argv)`` * ``PyInitError Py_InitializeFromConfig(const PyConfig *config)`` This means that callers who want to behave like Python will request an equivalent config and then use it. Those callers who want to be *nearly* like Python can change things in between. And the precedence rules get simpler because Py_SetConfig* just overwrites anything: PyConfig config = PyConfig_INIT; Py_SetConfigFromWideArgs(&config, argc, argv); /* optionally change any settings */ Py_InitializeFromConfig(&config); We could even split out PyMainConfig here and have another function to collect the settings to pass to Py_RunMain() (such as the script or module name, things to print on exit, etc.). Our python.c then uses this configuration, so it gets a few lines longer than at present. But it becomes a more useful example for people who want a nearly-like-Python version, and also ensures that any new configuration we support will be available to embedders. So with that said, here's what I think about the fields:
* ``argv``: ``sys.argv`` * ``base_exec_prefix``: ``sys.base_exec_prefix`` * ``base_prefix``: ``sys.base_prefix`` * ``exec_prefix``: ``sys.exec_prefix`` * ``executable``: ``sys.executable`` * ``prefix``: ``sys.prefix`` * ``xoptions``: ``sys._xoptions``
I like all of these, as they nicely map to their sys members.
* ``module_search_path_env``: ``PYTHONPATH`` environment variale value * ``module_search_paths``, ``use_module_search_paths``: ``sys.path``
Why not just have "path" to mirror sys.path? If we go to a Py_SetConfig approach then all the logic for inferring the path is in there, and if we don't then the FromArgs() function will either totally override or ignore it. Either way, no embedder needs to set these individually.
* ``home``: Python home * ``program_name``: Program name * ``program``: ``argv[0]`` or an empty string * ``user_site_directory``: if non-zero, add user site directory to ``sys.path``
Similarly, these play a role when inferring the regular Python sys.path, but are not something you should need to use when embedding.
* ``dll_path`` (Windows only): Windows DLL path
I'd have to look up exactly how this is used, but I don't think it's configurable (certainly not by this point where it's already been loaded).
* ``filesystem_encoding``: Filesystem encoding, ``sys.getfilesystemencoding()`` * ``filesystem_errors``: Filesystem encoding errors, ``sys.getfilesystemencodeerrors()``
See above. If we need these earlier in initialization (and I agree we probably do), then let's just put them in the pre-initialize config.
* ``dump_refs``: if non-zero, display all objects still alive at exit * ``inspect``: enter interactive mode after executing a script or a command * ``interactive``: interactive mode * ``malloc_stats``: if non-zero, dump memory allocation statistics at exit * ``quiet``: quiet mode (ex: don't display the copyright and version messages even in interactive mode) * ``show_alloc_count``: show allocation counts at exit * ``show_ref_count``: show total reference count at exit * ``site_import``: import the ``site`` module at startup? * ``skip_source_first_line``: skip the first line of the source
These all seem to be flags specific to regular Python and not embedders (providing we have ways for embedders to achieve the same thing through their own API calls, but even if we don't, I'd still rather not codify them as public runtime API). Having them be an optional flag for Py_RunMain() or a PyMainConfig struct would keep them closer to where they belong.
* ``faulthandler``: if non-zero, call ``faulthandler.enable()`` * ``install_signal_handlers``: install signal handlers? * ``tracemalloc``: if non-zero, call ``tracemalloc.start(value)``
Probably it's not a good idea to install signal handlers too early, but I think the other two should be able to start any time between pre-init and actual init, no? So make PyFaultHandler_Init() and PyTraceMalloc_Init() public and say "if you want them, call them".
* ``run_command``: ``-c COMMAND`` argument * ``run_filename``: ``python3 SCRIPT`` argument * ``run_module``: ``python3 -m MODULE`` argument
I think these should be in a separate configuration struct for Py_RunMain(), probably along with the section above too. (The other fields all seem fine to me.)
This PEP adds a new ``Py_UnixMain()`` function which takes command line arguments as bytes::
int Py_UnixMain(int argc, char **argv)
I was part of the discussion about this, so I have some more context than what's in the PEP. Given your next example shows this function would be about six lines long, why do we want to add it? Better to deprecate Py_Main() completely and just copy/paste those six lines from elsewhere (which also helps in the case where you need to do one little thing more and end up having to find those six lines anyway, as in the same example).
Open Questions ==============
* Do we need to add an API for import ``inittab``?
I don't think so. Once embedders have a three-step initialization (pre-init, init, runmain) there are two valid places they can call their own init functions.
* What about the stable ABI? Should we add a version into ``PyPreConfig`` and ``PyConfig`` structures somehow? The Windows API is known for its ABI stability and it stores the structure size into the structure directly. Do the same?
Yeah, I think so. We already have the Py*_INIT macros, so we could just use a version number that we increment manually, and that will let us optionally ignore added members (or deprecated members that have not been physically removed). (Though I'd rather make it easier for applications to include a local copy of Python rather than have to safely deal with whatever is installed on the system. But that's a Windows-ism, so making both approaches work is important :) )
* The PEP 432 stores ``PYTHONCASEOK`` into the config. Do we need to add something for that into ``PyConfig``? How would it be exposed at the Python level for ``importlib``? Passed as an argument to ``importlib._bootstrap._setup()`` maybe? It can be added later if needed.
Could we convert it into an xoption? It's very rarely used, to my knowledge.
* ``python._pth`` (Windows only)
I'd like to make this for all platforms, though my understanding was that since Python is not really relocatable on other OS's it isn't so important. And if the rest of this change happens it'll be easier to implement anyway :)

Le dim. 31 mars 2019 à 01:49, Steve Dower <steve.dower@python.org> a écrit :
Here is my first review of https://www.python.org/dev/peps/pep-0587/ and in general I think it's very good.
Ah nice, that's a good start :-) Thanks for reviewing it. Your email is long, and answer makes it even longer, so I will reply in multiple emails.
``PyWideCharList`` is a list of ``wchar_t*`` strings.
I always forget whether "const" is valid in C99, but if it is, can we make this a list of const strings?
Short answer: no :-( This structure mostly exists to simplify the implementation. Sadly, "const PyWideCharList" doesn't automatically make PyWideCharList.items an array of "const wchar_t*". I tried some hacks to have an array of const strings... but it would be very complicated and not natural at all in C. Sadly, it's way more simple to have "wchar_t*" in practice.
I also prefer a name like ``PyWideStringList``, since that's what it is (the other places we use WideChar in the C API refer only to a single string, as far as I'm aware).
I'm fine with this name.
``PyInitError`` is a structure to store an error message or an exit code for the Python Initialization.
I love this struct! Currently it's private, but I wonder whether it's worth making it public as PyError (or PyErrorInfo)?
The PEP 587 makes the structure public, but I'm not sure about calling it PyError because PyInitError also allows to exit Python with an exit status which is something specific to the initialization. If you want to use a structure to reporting errors, I would prefer to add a new simpler PyError structure to only report an error message, but never exit Python. PyInitError use case is really specific to Python initialization. Moreover, the API is inefficient since it is returned by copy, not by reference. That's fine for Python initialization which only happens once and is not part of "hot code". I'm not sure if PyError would need to store the C function name where the error is triggered. Usually, we try hard to hide Python internals to the user.
* ``exitcode`` (``int``): if greater or equal to zero, argument passed to ``exit()``
Windows is likely to need/want negative exit codes, as many system errors are represented as 0x80000000|(source of error)|(specific code).
Hum, int was used in Python 3.6 code base. We change change PyInitError.exitcode type to DWORD on Windows, but use int on Unix. We can add a private field to check if the error is an error message or an exit code. Or maybe check if the error message is NULL or not. Py_INIT_ERR(MSG) must never be called with Py_INIT_ERR(NULL) and it should be called with a static string, not with a dynamically allocated string (since the API doesn't allow to release memory).
* ``user_err`` (int): if non-zero, the error is caused by the user configuration, otherwise it's an internal Python error.
Maybe we could just encode this as "positive exitcode is user error, negative is internal error"? I'm pretty sure struct return values are passed by reference in most C calling conventions, so the size of the struct isn't a big deal, but without a proper bool type it may look like this is a second error code (like errno/winerror in a few places).
Honestly, I'm not sure that we really have to distinguish "user error" and "internal error". It's an old debate about calling abort()/DebugBreak() or not. It seems like most users are annoyed by getting a coredump on Unix when abort() is called. Maybe we should just remove Py_INIT_USER_ERR(), always use Py_INIT_ERR(), and never call abort()/DebugBreak() in Py_ExitInitError(). Or does anyone see a good reason to trigger a debugger on an initialization error? See https://bugs.python.org/issue19983 discussion: "When interrupted during startup, Python should not call abort() but exit()" Note: I'm not talking about Py_FatalError() here, this one will not change. Victor

Thanks for the replies. Anything I don't comment on means that I agree with you :) On 05Apr2019 0900, Victor Stinner wrote:
Honestly, I'm not sure that we really have to distinguish "user error" and "internal error". It's an old debate about calling abort()/DebugBreak() or not. It seems like most users are annoyed by getting a coredump on Unix when abort() is called.
I'm also annoyed by the crash reports on Windows when "encodings" cannot be found (because occasionally there are enough of them that the Windows team starts reviewing the issue, and I get pulled in to review and resolve their bugs).
Maybe we should just remove Py_INIT_USER_ERR(), always use Py_INIT_ERR(), and never call abort()/DebugBreak() in Py_ExitInitError().
Not calling abort() sounds fine to me. Embedders would likely prefer an error code rather than a crash, but IIRC they'd have to call Py_ExitInitError() to get the crash anyway, right?
Or does anyone see a good reason to trigger a debugger on an initialization error?
Only before returning from the point where the error occurs. By the time you've returned the error value all the useful context is gone.
Note: I'm not talking about Py_FatalError() here, this one will not change.
Does this get called as part of initialization? If not, I'm fine with it not changing. Cheers, Steve

About PyPreConfig and encodings.
The appendix is excellent, by the way. Very useful detail to have written down.
Thanks. The appendix is based on Include/cpython/coreconfig.h comments which is now my reference documentation. By the way, this header file contains more information about PyConfig fields than the PEP 587. For example, the comment on filesystem_encoding and filesystem_errors lists every single cases and exceptions (it describes the implementation).
``PyPreConfig`` structure is used to pre-initialize Python:
* Set the memory allocator * Configure the LC_CTYPE locale * Set the UTF-8 mode
I think we should have the isolated flag in here - oh wait, we do - I think we should have the isolated/use_environment options listed in this list :)
My introduction paragraph only explains the changes made by Py_PreInitialize(): calling Py_PreInitialize() doesn't "isolate" Python. PyPreConfig.isolated is used to decide if Python reads environment variables or not. Examples: PYTHONMALLOC, PYTHONUTF8, PYTHONDEVMODE (which has an impact on PyPreConfig.allocator), PYTHONCOERCECLOCALE, etc. That's why isolated and use_environment are present in PyPreConfig and PyConfig. In practice, values should be equal in both structures. Moreover, if PyConfig.isolated is equal to 1, Py_InitializeFromConfig() updates _PyRuntime.preconfig.isolated to 1 ;-)
* ``PyInitError Py_PreInitialize(const PyPreConfig *config)`` * ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config, int argc, char **argv)`` * ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig *config, int argc, wchar_t **argv)``
I hope to one day be able to support multiple runtimes per process - can we have an opaque PyRuntime object exposed publicly now and passed into these functions?
I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I chose to not do so. Currently, there is a single global variable _PyRuntime which has the type _PyRuntimeState. The _PyRuntime_Initialize() API is designed around this global variable. For example, _PyRuntimeState contains the registry of interpreters: you don't want to have multiple registries :-) I understood that we should only have a single instance of _PyRuntimeState. So IMHO it's fine to keep it private at this point. There is no need to expose it in the API.
(FWIW, I think we're a long way from being able to support multiple runtimes *simultaneously*, so the initial implementation here would be to have a PyRuntime_Create() that returns our global one once and then errors until it's finalised. The migration path is probably to enable switching of the current runtime via a dedicated function (i.e. one active at a time, probably with thread local storage), since we have no "context" parameter for C API functions, and obviously there are still complexities such as poorly written extension modules that nonetheless can be dealt with in embedding scenarios by simply not using them. This doesn't seem like an unrealistic future, *unless* we add a whole lot of new APIs now that can't allow it :) )
FYI I tried to design an internal API with a "context" to pass _PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc. => https://bugs.python.org/issue35265 My first need was to pass a memory allocator to Py_DecodeLocale(). There are 2 possible implementations: * Modify *all* functions to add a new "context" parameter and modify *all* functions to pass this parameter to sub-functions. * Store the current "context" as a thread local variable or something like that. I wrote a proof-of-concept of the first option: the implementation was very painful to write: a lot of changes which looks useless and a lot of new private functions which to pass the argument. I had to modify way too much code. I gave up. For the second option: well, there is no API change needed! It can be done later. Moreover, we already have such API! PyThreadState_Get() gets the Python thread state of the current thread: the current interpreter can be accessed from there.
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2. I prefer keep a separated field. Moreover, I understood that you might want to coerce the C locale *and* get the warning, or get the warning but *not* coerce the locale.
* ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the Python filesystem encoding to ``"mbcs"``. * ``utf8_mode``: if non-zero, enable the UTF-8 mode
Why not just set the encodings here?
For different technical reasons, you simply cannot specify an encoding name. You can also pass options to tell Python that you have some preferences (PyPreConfig and PyConfig fields). Python doesn't support any encoding and encoding errors combinations. In practice, it only supports a narrow set of choices. The main implementation are Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec of the current locale encoding to implement the filesystem encoding, before the codec implemented in Python can be used. Basically, only the current locale encoding or UTF-8 are supported. If you want UTF-8, enable the UTF-8 Mode. To load the Python codec, you need importlib. importlib needs to access the filesystem which requires a codec to encode/decode file names (PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API only supports bytes char* strings). Py_PreInitialize() doesn't set the filesystem encoding. It initializes the LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and Py_LegacyWindowsFSEncodingFlag).
Obviously we are not ready to import most encodings after pre initialization, but I think that's okay. Embedders who set something outside the range of what can be used without importing encodings will get an error to that effect if we try.
You need a C implementation of the Python filesystem encoding very early in Python initialization. You cannot start with one encoding and "later" switch the encoding. I tried multiple times the last 10 years and I always failed to do that. All attempts failed with mojibake at different levels. Unix pays the price of its history. Windows is a very different story: there are API to access the filesystem with Unicode strings, there is no such "bootstrap problem" for importlib.
In fact, I'd be totally okay with letting embedders specify their own function pointer here to do encoding/decoding between Unicode and the OS preferred encoding.
In my experience, when someone wants to get a specific encoding: they only want UTF-8. There is now the UTF-8 Mode which ignores the locale and forces the usage of UTF-8. I'm not sure that there is a need to have a custom codec. Moreover, if there an API to pass a codec in C, you will need to expose it somehow at the Python level for os.fsencode() and os.fsdecode(). Currently, Python ensures during early stage of startup that codecs.lookup(sys.getfilesystemencoding()) works: there is a existing Python codec for the requested filesystem encoding. Victor

On Sat, Apr 6, 2019 at 1:13 AM Victor Stinner <vstinner@redhat.com> wrote:
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2. I prefer keep a separated field.
Moreover, I understood that you might want to coerce the C locale *and* get the warning, or get the warning but *not* coerce the locale.
Are these configurations are really needed? Applications embedding Python may not initialize Python interpreter at first. For example, vim initializes Python when Python is used first time. On the other hand, C locale coercion should be done ASAP application starts. I think dedicated API for coercing C locale is better than preconfig. // When application starts: Py_CoerceCLocale(warn=0); // later... Py_Initialize(); -- Inada Naoki <songofacandy@gmail.com>

Maybe I should clarify in the PEP 587 Rationale what are the use cases for the API. Embeding Python is one kind of use case, but writing your own Python with a specific config like "isolated Python" or "system Python" is also a valid use case. For a custom Python, you might want to get C locale coercion and UTF-8 Mode. The most common case is to embed Python in an application like Blender or vim: the application already executes a lot of code and manipulated strings and encoding before Python is initialized, so Python must not coerce the C locale in that case. That's why Nick and me decided to disable C loclae coercion and UTF-8 Mode by default when the C API is used. Victor Le samedi 6 avril 2019, Inada Naoki <songofacandy@gmail.com> a écrit :
On Sat, Apr 6, 2019 at 1:13 AM Victor Stinner <vstinner@redhat.com> wrote:
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C
locale
is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2. I prefer keep a separated field.
Moreover, I understood that you might want to coerce the C locale *and* get the warning, or get the warning but *not* coerce the locale.
Are these configurations are really needed?
Applications embedding Python may not initialize Python interpreter at first. For example, vim initializes Python when Python is used first time.
On the other hand, C locale coercion should be done ASAP application starts.
I think dedicated API for coercing C locale is better than preconfig.
// When application starts: Py_CoerceCLocale(warn=0);
// later... Py_Initialize();
-- Inada Naoki <songofacandy@gmail.com>
-- Night gathers, and now my watch begins. It shall not end until my death.

On Sat, 6 Apr 2019 at 02:16, Victor Stinner <vstinner@redhat.com> wrote:
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2. I prefer keep a separated field.
Moreover, I understood that you might want to coerce the C locale *and* get the warning, or get the warning but *not* coerce the locale.
Yeah, that's how they ended up being two different fields in the first place. However, I wonder if the two fields might be better named: * warn_on_legacy_c_locale * coerce_legacy_c_locale Neither set: legacy C locale is left alone Only warning flag set: complain about the legacy C locale on stderr Only coercion flag set: silently attempt to coerce the legacy C locale to a UTF-8 based one Both flags set: attempt the coercion, and then complain about it on stderr (regardless of whether the coercion succeeded or not) The original PEP 580 implementation tried to keep the config API simpler by always complaining, but that turned out to break the world (plenty of contexts where things get upset by unexpected output on stderr). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, 7 Apr 2019 at 12:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
The original PEP 580 implementation tried to keep the config API simpler by always complaining, but that turned out to break the world (plenty of contexts where things get upset by unexpected output on stderr).
Err, PEP 538. No idea why my brain swapped in the wrong PEP number :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 05Apr2019 0912, Victor Stinner wrote:
About PyPreConfig and encodings. [...]
* ``PyInitError Py_PreInitialize(const PyPreConfig *config)`` * ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config, int argc, char **argv)`` * ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig *config, int argc, wchar_t **argv)``
I hope to one day be able to support multiple runtimes per process - can we have an opaque PyRuntime object exposed publicly now and passed into these functions?
I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I chose to not do so.
Currently, there is a single global variable _PyRuntime which has the type _PyRuntimeState. The _PyRuntime_Initialize() API is designed around this global variable. For example, _PyRuntimeState contains the registry of interpreters: you don't want to have multiple registries :-)
I understood that we should only have a single instance of _PyRuntimeState. So IMHO it's fine to keep it private at this point. There is no need to expose it in the API.
So I didn't want to expose that particular object right now, but just some sort of "void*" parameter in the new APIs (and require either NULL or a known value be passed). That gives us the freedom to enable multiple runtimes in the future without having to change the API shape.
FYI I tried to design an internal API with a "context" to pass _PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc. [...] There are 2 possible implementations:
* Modify *all* functions to add a new "context" parameter and modify *all* functions to pass this parameter to sub-functions. * Store the current "context" as a thread local variable or something like that. [...] For the second option: well, there is no API change needed! It can be done later. Moreover, we already have such API! PyThreadState_Get() gets the Python thread state of the current thread: the current interpreter can be accessed from there.
Yes, this is what I had in mind as a transition. I think eventually it would be best to have the context parameter, as thread-local variables have overhead and add significant complexity (particularly when debugging crashes), but making that change is huge.
``PyPreConfig`` fields:
* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale is coerced. * ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to 1, read the LC_CTYPE to decide if it should be coerced.
Can we use another value for coerce_c_locale to determine whether to warn or not? Save a field.
coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2. I prefer keep a separated field.
Moreover, I understood that you might want to coerce the C locale *and* get the warning, or get the warning but *not* coerce the locale.
If we define meaningful constants, then it doesn't matter how many values it has. We could have PY_COERCE_LOCALE_AND_WARN, PY_COERCE_LOCALE_SILENTLY, PY_WARN_WITHOUT_COERCE etc. to represent the states. These actually make things simpler than trying to reason about how two similar parameters interact.
* ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the Python filesystem encoding to ``"mbcs"``. * ``utf8_mode``: if non-zero, enable the UTF-8 mode
Why not just set the encodings here?
For different technical reasons, you simply cannot specify an encoding name. You can also pass options to tell Python that you have some preferences (PyPreConfig and PyConfig fields).
Python doesn't support any encoding and encoding errors combinations. In practice, it only supports a narrow set of choices. The main implementation are Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec of the current locale encoding to implement the filesystem encoding, before the codec implemented in Python can be used.
Basically, only the current locale encoding or UTF-8 are supported. If you want UTF-8, enable the UTF-8 Mode.
If we already had a trivial way to specify the default encodings as a string before any initialization has occurred, I think we would have made UTF-8 mode enabled by setting them to "utf-8" rather than a brand new flag. Again, we either have a huge set of flags to infer certain values at certain times, or we can just make them directly settable. If we make them settable, it's much easier for users to reason about what is going to happen.
To load the Python codec, you need importlib. importlib needs to access the filesystem which requires a codec to encode/decode file names (PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API only supports bytes char* strings).
Right, and the few places where we need an encoding *before* we can load any arbitrary ones we can easily compare the strings and fail if someone's trying to do something unusual (or if the platform can do the lookup itself, it could succeed). If we say "passing NULL means use the default" then we have that handled, and the actual encoding just gets set to the real default once we figure out what that is.
Py_PreInitialize() doesn't set the filesystem encoding. It initializes the LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and Py_LegacyWindowsFSEncodingFlag).
Right, I'm proposing a simplification here where it *does* set the filesystem encoding (even though it doesn't get used until Py_Initialize() is called). That way we can use the filesystem encoding to access the filesystem during initialization, provided it's one of the built-in supported ones (e.g. NULL, which means the C locale, or "utf-8" which means UTF-8) rather than relying on the tables in the standard library. Oh look, I said all this in my original email:
Obviously we are not ready to import most encodings after pre initialization, but I think that's okay. Embedders who set something outside the range of what can be used without importing encodings will get an error to that effect if we try.
You need a C implementation of the Python filesystem encoding very early in Python initialization. You cannot start with one encoding and "later" switch the encoding. I tried multiple times the last 10 years and I always failed to do that. All attempts failed with mojibake at different levels.
Again, this is for embedders. Regular Python users will only ever request "NULL" or "utf-8", depending on the UTF-8 mode flag. And embedders have to make sure they get what they ask for and also can't change it later. The problems you've hit in the past have always been to do with trying to infer or guess the actual encoding, rather than simply letting someone tell you what it is (via config) and letting them deal with the failure.
In fact, I'd be totally okay with letting embedders specify their own function pointer here to do encoding/decoding between Unicode and the OS preferred encoding.
In my experience, when someone wants to get a specific encoding: they only want UTF-8. There is now the UTF-8 Mode which ignores the locale and forces the usage of UTF-8.
Your experience here sounds like it's limited to POSIX systems. I've wanted UTF-16 before, and been able to provide it (if Python had allowed me to provide a callback to encode/decode). And again, all this is about "why do we need to define a boolean that determines what the encoding is when we can just let people tell us what encoding they want". There's a good chance that an embedded Python isn't going to touch the real filesystem anyway.
I'm not sure that there is a need to have a custom codec. Moreover, if there an API to pass a codec in C, you will need to expose it somehow at the Python level for os.fsencode() and os.fsdecode().
We need to expose those operations anyway, and os.fsencode/fsdecode have their own issues (particularly since there *are* ways to change filesystem encoding while running). Turning them into actual native functions that might call out to a host-provided callback would not be difficult.
Currently, Python ensures during early stage of startup that codecs.lookup(sys.getfilesystemencoding()) works: there is a existing Python codec for the requested filesystem encoding.
Right, it's a validation step. But we can also make codecs.lookup("whatever the file system encoding is") return something based on os.fsencode() and os.fsdecode(). We're not actually beholden to the current implementations here - we are allowed to change them! ;)

Example of Python initialization enabling the isolated mode::
PyConfig config = PyConfig_INIT; config.isolated = 1;
Haven't we already used extenal values by this point that should have been isolated?
On this specific example, "config.isolated = 1;" ensures that Py_PreInitialize() is also called internally with "PyPreConfig.isolated = 1".
I'd rather have the isolation up front. Or better yet, make isolation the default unless you call one of the "FromArgs" functions, and then we don't actually need the config setting at all.
While there are supporters of an "isolated Python" (sometimes called "system python"), the fact that it doesn't exist in any Linux distribution nor on any other operating system (Windows, macOS, FreeBSD), whereas it's already doable in Python 3.6 with Py_IsolatedFlag=1 makes me think that users like the ability to control Python with environment variables and configuration files. I would prefer to leave Python as not isolated by default. It's just a matter of comment line arguments.
* The PEP 432 stores ``PYTHONCASEOK`` into the config. Do we need to add something for that into ``PyConfig``? How would it be exposed at the Python level for ``importlib``? Passed as an argument to ``importlib._bootstrap._setup()`` maybe? It can be added later if needed.
Could we convert it into an xoption? It's very rarely used, to my knowledge.
The first question is if there is any need for an embedder to change this option. Currently, importlib._bootstrap_external._install() reads the environment variable and it's the only way to control the option. ... By the way, importlib reads PYTHONCASEOK environment varaible even if isolated mode is enabled (sys.flags.isolated is equal to 1). Is it a bug? :-) Victor

On 05Apr2019 0922, Victor Stinner wrote:
While there are supporters of an "isolated Python" (sometimes called "system python"), the fact that it doesn't exist in any Linux distribution nor on any other operating system (Windows, macOS, FreeBSD), whereas it's already doable in Python 3.6 with Py_IsolatedFlag=1 makes me think that users like the ability to control Python with environment variables and configuration files.
I would prefer to leave Python as not isolated by default. It's just a matter of command line arguments.
Not for embedders it isn't. When embedding you need to do a whole lot of special things to make sure that your private version of Python doesn't pick up settings relating to a regular Python install. We should make the Python runtime isolated by default, and only (automatically) pick up settings from the environment in the Python binary.
* The PEP 432 stores ``PYTHONCASEOK`` into the config. Do we need to add something for that into ``PyConfig``? How would it be exposed at the Python level for ``importlib``? Passed as an argument to ``importlib._bootstrap._setup()`` maybe? It can be added later if needed.
Could we convert it into an xoption? It's very rarely used, to my knowledge.
The first question is if there is any need for an embedder to change this option. Currently, importlib._bootstrap_external._install() reads the environment variable and it's the only way to control the option.
I think the first question should be "is there any reason to prevent an embedder from changing this option". In general, the answer is going to be no. We should expose all the options we rely on to embedders, or else they're going to have to find workarounds.
... By the way, importlib reads PYTHONCASEOK environment varaible even if isolated mode is enabled (sys.flags.isolated is equal to 1). Is it a bug? :-)
Yes, I think it's a bug. Perhaps this should become a proper configuration option, rather than a directly-read environment variable?

I think my biggest point (about halfway down) is that I'd rather use argv/environ/etc. to *initialize* PyConfig and then only use the config for initializing the runtime. That way it's more transparent for users and more difficult for us to add options that embedders can't access.
I chose to exclude PyConfig_Read() function from the PEP to try to start with the bare minimum public API and see how far we can go with that. The core of the PEP 587 implementation are PyPreConfig_Read() and PyConfig_Read() functions (currently called _PyPreConfig_Read() and _PyCoreConfig_Read()): they populate all fields so the read config becomes the reference config which will be applied. For example, PyConfig_Read() fills module_search_paths, from other PyConfig fields: it will become sys.path. I spent a lot of time to rework deeply the implementation of PyConfig_Read() to make sure that it has no side effect. Reading and writing the configuration are now strictly separated. So it is safe to call PyConfig_Read(), modify PyConfig afterwards, and pass the modified config to Py_InitializeFromConfig(). Do you think that exposing PyConfig_Read() would solve some of your problems?
Currently you have three functions, that take a PyConfig and optionally also use the environment/argv to figure out the settings:
* ``PyInitError Py_InitializeFromConfig(const PyConfig *config)`` * ``PyInitError Py_InitializeFromArgs(const PyConfig *config, int argc, char **argv)`` * ``PyInitError Py_InitializeFromWideArgs(const PyConfig *config, int argc, wchar_t **argv)``
I would much prefer to see this flipped around, so that there is one initialize function taking PyConfig, and two functions that will fill out the PyConfig based on the environment:
(note two of the "const"s are gone)
* ``PyInitError Py_SetConfigFromArgs(PyConfig *config, int argc, char **argv)`` * ``PyInitError Py_SetConfigFromWideArgs(PyConfig *config, int argc, wchar_t **argv)`` * ``PyInitError Py_InitializeFromConfig(const PyConfig *config)``
This implementation evolved *A LOT* last months. I was *very confused* until the pre-initialization phase was introduced which solved a lot of bootstrap issues. After I wrote down the PEP and read it again, I also came to the same conclusion: Py_InitializeFromConfig(config) should be enough, and we can add helper functions to set arguments on PyConfig (as you showed). Victor

For the PyMainConfig structure idea, I cannot comment at this point. I need more time to think about it. About the "path configuration" fields, maybe a first step to enhance the API would be to add the the following function: PyInitError PyConfig_ComputePath(PyConfig *config, const wchar *home); where home can be NULL (and PyConfig.module_search_paths_env field goes away: the function reads PYTHONPATH env var internally). This function would "compute the path configuration", what's currently listed in _PyCoreConfig under: /* Path configuration outputs */ int use_module_search_paths; /* If non-zero, use module_search_paths */ _PyWstrList module_search_paths; /* sys.path paths. Computed if use_module_search_paths is equal to zero. */ wchar_t *executable; /* sys.executable */ wchar_t *prefix; /* sys.prefix */ wchar_t *base_prefix; /* sys.base_prefix */ wchar_t *exec_prefix; /* sys.exec_prefix */ wchar_t *base_exec_prefix; /* sys.base_exec_prefix */ #ifdef MS_WINDOWS wchar_t *dll_path; /* Windows DLL path */ #endif Victor

On 05Apr2019 0936, Victor Stinner wrote:
For the PyMainConfig structure idea, I cannot comment at this point. I need more time to think about it.
About the "path configuration" fields, maybe a first step to enhance the API would be to add the the following function:
PyInitError PyConfig_ComputePath(PyConfig *config, const wchar *home);
where home can be NULL (and PyConfig.module_search_paths_env field goes away: the function reads PYTHONPATH env var internally).
Yes, I like this. Maybe pass PYTHONPATH value in as an "additional paths" parameter? Basically, this function would be the replacement for "Py_GetPath()" (which initializes paths to the defaults the first time it is called), and setting the path fields in PyConfig manually is the replacement for Py_SetPath() (or calling the various Py_Set*() functions to make the default logic infer the paths you want). Similarly, PyConfig_ComputeFromArgv() and/or PyConfig_ComputeFromEnviron() functions would also directly replace the magic we have scattered all over the place right now. It would also make it more obvious to the callers which values take precedence, and easier to see that there should be no side effects. I think it's easier to document as well. Cheers, Steve

Ah, I forgot to say that a major enhancement for the implementation is that I wrote a lot of new unit tests for the existing Python Initialization API. Python 3.7 has most of these tests. I wrote even more tests for my new private initialization API ;-) I wrote these tests to specify and validate the priority between the different ways to configuration Python and "rules" (a parameter setting other parameters): https://github.com/python/peps/blob/master/pep-0587.rst#priority-and-rules Victor Le mer. 27 mars 2019 à 18:48, Victor Stinner <vstinner@redhat.com> a écrit :
Hi,
I would like to add a new C API to initialize Python. I would like your opinion on the whole API before making it public. The code is already implemented. Doc of the new API:
https://pythondev.readthedocs.io/init_config.html
To make the API public, _PyWstrList, _PyInitError, _PyPreConfig, _PyCoreConfig and related functions should be made public.
By the way, I would suggest to rename "_PyCoreConfig" to just "PyConfig" :-) I don't think that "core init" vs "main init" is really relevant: more about that below.
Let's start with two examples using the new API.
Example of simple initialization to enable isolated mode:
_PyCoreConfig config = _PyCoreConfig_INIT; config.isolated = 1;
_PyInitError err = _Py_InitializeFromConfig(&config); if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); } /* ... use Python API here ... */ Py_Finalize();
Example using the pre-initialization to enable the UTF-8 Mode (and use the "legacy" Py_Initialize() function):
_PyPreConfig preconfig = _PyPreConfig_INIT; preconfig.utf8_mode = 1;
_PyInitError err = _Py_PreInitialize(&preconfig); if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); }
/* at this point, Python will only speak UTF-8 */
Py_Initialize(); /* ... use Python API here ... */ Py_Finalize();
Since November 2017, I'm refactoring the Python Initialization code to cleanup the code and prepare a new ("better") API to configure Python Initialization. I just fixed the last issues that Nick Coghlan asked me to fix (add a pre-initialization step: done, fix mojibake: done). My work is inspired by Nick Coghlan's PEP 432, but it is not implementing it directly. I had other motivations than Nick even if we are somehow going towards the same direction.
Nick wants to get a half-initialized Python ("core init"), configure Python using the Python API and Python objects, and then finish the implementation ("main init").
I chose a different approach: put *everything* into a single C structure (_PyCoreConfig) using C types. Using the structure, you should be able to do what Nick wanted to do, but with C rather than Python. Nick: please tell me if I'm wrong :-)
This work is also connected to Eric Snow's work on sub-interpreters (PEP 554) and moving global variables into structures. For example, I'm using his _PyRuntime structure to store a new "preconfig" state (pre-initialization configuration, more about that below).
In November 2017, when I started to work on the Python Initialization (bpo-32030), I identified the following problems:
* Many parts of the code were interdependent * Code executed early in Py_Main() used the Python API before the Python API was fully initialized. Like code parsing -W command line option which used PyUnicode_FromWideChar() and PyList_Append(). * Error handling used Py_FatalError() which didn't let the caller to decide how to handle the error. Moreover, exit() was used to exit Python, whereas libpython shouldn't do that: a library should not exit the whole process! (imagine when Python is embedded inside an application)
One year and a half later, I implemented the following solutions:
* Py_Main() and Py_Initialize() code has been reorganized to respect priorities between global configuration variables (ex: Py_IgnoreEnvironmentFlag), environment variables (ex: PYTHONPATH), command line arguments (ex: -X utf8), configuration files (ex: pyenv.cfg), and the new _PyPreConfig and _PyCoreConfig structures which store the whole configuration. * Python Initialization no longer uses the Python API but only C types like wchar_t* strings, a new _PyWstrList structure and PyMem_RawMalloc() memory allocator (PyMem_Malloc() is no longer used during init). * The code has been modified to use a new _PyInitError structure. The caller of the top function gets control to cleanup everything before handling the error (display a fatal error message or simply exit Python).
The new _PyCoreConfig structure has the top-priority and provides a single structure for all configuration parameters.
It becomes possible to override the code computing the "path configuration" like sys.path to fully control where Python looks to import modules. It becomes possible to use an empty list of paths to only allow builtin modules.
A new "pre-initialization" steps is responsible to configure the bare minimum before the Python initialization: memory allocators and encodings (LC_CTYPE locale and the UTF-8 mode). The LC_CTYPE is no longer coerced and the UTF-8 Mode is no longer enabled automatically depending on the user configuration to prevent mojibake. Previously, calling Py_DecodeLocale() to get a Unicode wchar_t* string from a bytes wchar* string created mojibake when called before Py_Initialize() if the LC_CTYPE locale was coerced and/or if the UTF-8 Mode was enabled.
The pre-initialization step ensures that the encodings and memory allocators are well defined *before* Py_Initialize() is called.
Since the new API is currently private, I didn't document it in Python. Moreover, the code changed a lot last year :-) But it should now be way more stable. I started to document it in a separated webpage:
https://pythondev.readthedocs.io/init_config.html
The plan is to put it in the Python documentation once it becomes public.
Victor -- Night gathers, and now my watch begins. It shall not end until my death.
-- Night gathers, and now my watch begins. It shall not end until my death.

On 27.03.2019 20:48, Victor Stinner wrote:
Hi,
I would like to add a new C API to initialize Python. I would like your opinion on the whole API before making it public. The code is already implemented. Doc of the new API:
https://pythondev.readthedocs.io/init_config.html
To make the API public, _PyWstrList, _PyInitError, _PyPreConfig, _PyCoreConfig and related functions should be made public.
By the way, I would suggest to rename "_PyCoreConfig" to just "PyConfig" :-) I don't think that "core init" vs "main init" is really relevant: more about that below.
Let's start with two examples using the new API.
Example of simple initialization to enable isolated mode:
_PyCoreConfig config = _PyCoreConfig_INIT; config.isolated = 1;
_PyInitError err = _Py_InitializeFromConfig(&config); By my outsider observation, the `config' argument and return code are asking to be added to Py_Initialize instead, `_Py_InitializeFromConfig` and `_Py_PreInitialize` look redundant. if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); } /* ... use Python API here ... */ Py_Finalize();
Example using the pre-initialization to enable the UTF-8 Mode (and use the "legacy" Py_Initialize() function):
_PyPreConfig preconfig = _PyPreConfig_INIT; preconfig.utf8_mode = 1;
_PyInitError err = _Py_PreInitialize(&preconfig); if (_Py_INIT_FAILED(err)) { _Py_ExitInitError(err); }
/* at this point, Python will only speak UTF-8 */
Py_Initialize(); /* ... use Python API here ... */ Py_Finalize();
Since November 2017, I'm refactoring the Python Initialization code to cleanup the code and prepare a new ("better") API to configure Python Initialization. I just fixed the last issues that Nick Coghlan asked me to fix (add a pre-initialization step: done, fix mojibake: done). My work is inspired by Nick Coghlan's PEP 432, but it is not implementing it directly. I had other motivations than Nick even if we are somehow going towards the same direction.
Nick wants to get a half-initialized Python ("core init"), configure Python using the Python API and Python objects, and then finish the implementation ("main init").
I chose a different approach: put *everything* into a single C structure (_PyCoreConfig) using C types. Using the structure, you should be able to do what Nick wanted to do, but with C rather than Python. Nick: please tell me if I'm wrong :-)
This work is also connected to Eric Snow's work on sub-interpreters (PEP 554) and moving global variables into structures. For example, I'm using his _PyRuntime structure to store a new "preconfig" state (pre-initialization configuration, more about that below).
In November 2017, when I started to work on the Python Initialization (bpo-32030), I identified the following problems:
* Many parts of the code were interdependent * Code executed early in Py_Main() used the Python API before the Python API was fully initialized. Like code parsing -W command line option which used PyUnicode_FromWideChar() and PyList_Append(). * Error handling used Py_FatalError() which didn't let the caller to decide how to handle the error. Moreover, exit() was used to exit Python, whereas libpython shouldn't do that: a library should not exit the whole process! (imagine when Python is embedded inside an application)
One year and a half later, I implemented the following solutions:
* Py_Main() and Py_Initialize() code has been reorganized to respect priorities between global configuration variables (ex: Py_IgnoreEnvironmentFlag), environment variables (ex: PYTHONPATH), command line arguments (ex: -X utf8), configuration files (ex: pyenv.cfg), and the new _PyPreConfig and _PyCoreConfig structures which store the whole configuration. * Python Initialization no longer uses the Python API but only C types like wchar_t* strings, a new _PyWstrList structure and PyMem_RawMalloc() memory allocator (PyMem_Malloc() is no longer used during init). * The code has been modified to use a new _PyInitError structure. The caller of the top function gets control to cleanup everything before handling the error (display a fatal error message or simply exit Python).
The new _PyCoreConfig structure has the top-priority and provides a single structure for all configuration parameters.
It becomes possible to override the code computing the "path configuration" like sys.path to fully control where Python looks to import modules. It becomes possible to use an empty list of paths to only allow builtin modules.
A new "pre-initialization" steps is responsible to configure the bare minimum before the Python initialization: memory allocators and encodings (LC_CTYPE locale and the UTF-8 mode). The LC_CTYPE is no longer coerced and the UTF-8 Mode is no longer enabled automatically depending on the user configuration to prevent mojibake. Previously, calling Py_DecodeLocale() to get a Unicode wchar_t* string from a bytes wchar* string created mojibake when called before Py_Initialize() if the LC_CTYPE locale was coerced and/or if the UTF-8 Mode was enabled.
The pre-initialization step ensures that the encodings and memory allocators are well defined *before* Py_Initialize() is called.
Since the new API is currently private, I didn't document it in Python. Moreover, the code changed a lot last year :-) But it should now be way more stable. I started to document it in a separated webpage:
https://pythondev.readthedocs.io/init_config.html
The plan is to put it in the Python documentation once it becomes public.
Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vano%40mail.mipt.ru
-- Regards, Ivan
participants (8)
-
Alexander Belopolsky
-
Brett Cannon
-
Inada Naoki
-
Ivan Pozdeev
-
Nick Coghlan
-
Stephen J. Turnbull
-
Steve Dower
-
Victor Stinner