Re: Make UTF-8 mode more accessible for Windows users.

On 2/6/21, Christopher Barker <pythonchb@gmail.com> wrote:
Chiefly, I don't want to overload "pyvenv.cfg" with new behavior that's unrelated to virtual environments. I also dislike the way this file is found. If the parent directory is "C:\Program Files", then I'm not worried about finding "C:\Program Files\pyvenv.cfg" when the interpreter tries to open it. But this pattern is not safe in general when installed to an arbitrary directory, or with a portable distribution. The presence of a "._pth" file (Windows only) beside the DLL or executable bypasses the search for "pyvenv.cfg", among other things. The embedded distribution includes a ._pth that locks it down. This is another reason to use a different file to configure defaults for -X settings such as "utf8", a file that's guaranteed to always be read.
The idea to use the profile data directories %ProgramData% and %LocalAppData% was for symmetry with how this could be supported in POSIX, which doesn't use the application directory as Windows does. The application "python.cfg" (in the directory of the executable, including a virtual environment) can support a setting to isolate it from system and user "python.cfg" files.

On Sun, Feb 7, 2021 at 4:16 PM Eryk Sun <eryksun@gmail.com> wrote:
OK, then, how about just same to python.exe? In this case, we need to put python.ini in Scripts directory for venvs. It seems a bit odd, but much simpler than looking in the parent directory.
Thank you, I didn't know that. If we need to search a parent directory, we need to check ._pth too.
Should we support it in Unix? I don't think so. Command-line and environment variables are easy to use on Unix. And beginners should use a UTF-8 locale.
I know that. But I don't think it's enough reason to put a new config file to user profile. If users don't have system privilege, they can still install another Python. Config file in user profile is fragile. If all venvs start using profile directory, it become unmaintainable soon. We can just recommend per-user install for new users. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Sun, Feb 7, 2021 at 3:58 PM Inada Naoki <songofacandy@gmail.com> wrote:
Chiefly, I don't want to overload "pyvenv.cfg" with new behavior that's unrelated to virtual environments.
This is my point -- this is NOT unrelated to virtual environments -- UTF-8 mode, and other configuration parameters are very much part of the (generic term) environment. The whole point here is to be able to set a configuration on a virtual environment. And I think that the venv tool should probably grow a feature to turn it on (or off). So my take is that we have pyvenv.cfg already, so why not use it for all the configuration one might want for a particular "environment".
I also dislike the way this file is found.
...
Indeed, that is unfortunate. And may well make this impossible -- I agree that a general configuration file shouldn't be found there. Oh well -- more config files it is! OK, then, how about just same to python.exe?
In this case, we need to put python.ini in Scripts directory for venvs. It seems a bit odd, but much simpler than looking in the parent directory.
I think that would work.
The issue (to me anyway) is not where is it, but rather the whole idea of putting it outside python, and in the user's space at all. Should we support it in Unix? I don't think so.
Command-line and environment variables are easy to use on Unix.
maybe, but we have many of the same issues -- we want the configuration tied to the environment, not to the user and all environments. And I'd rather have things done the same way on all platforms, rather than the native way on each platform, if I have to make a choice. That is, if there is a way to configure Python on Windows, I'd really like the SAME way to be available on all platforms.
And beginners should use a UTF-8 locale.
Beginners may not know how to do that / have a choice. This is a question I still don't know the answer to -- I think that most (all?) non Windows platforms currently supported use utf-8 -- but is that guaranteed? That is, might some platform come up that does need utf-8 mode? So why not have it available everywhere, even though it will be a no-op on most systems.
exactly -- I"m trying to imagine a case where a user doesn't have read access to the place the python.exe is, but DOES need to override this one thing. That is, a user can either control the python install they are using or they can't.
Config file in user profile is fragile. If all venvs start using profile directory, it become unmaintainable soon.
exactly -- if this is added, I will certianly not recommend anyone use it.
We can just recommend per-user install for new users.
or a virtual environment :-) Thanks Eryk for bringing clarity to these issues. -Chris B. -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Feb 8, 2021 at 3:58 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Unix, there are N ways (e.g. .envrc). N+1 way is really worthwhile? At least, `python.cfg` (or `python.ini`) in bin/ directory is not good for Unix environment.
UTF-8 mode is provided for Unix because there is environments for *deployment*, like minimal Unix container image. They have only C locale. For desktop use, I think all Unix environments suited for beginners use UTF-8 locale by default. There is no guarantee. But if default locale is not UTF-8, I don't think the environment is suited for beginners who learning to Python. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Sun, Feb 7, 2021 at 11:19 PM Inada Naoki <songofacandy@gmail.com> wrote:
yes -- I much prefer a "this is how you do it for Python" than a bunch of platform specific details. And is there a good way to do it for environments (of various sorts) ? At least, `python.cfg` (or `python.ini`) in bin/ directory is not good
for Unix environment.
hmm -- that is true (though it is THAT bad ?!?), though it would be fine for virtual environments. And As has been mentioned many times on this is generally not a great configuration to set globally anyway.
That's true, but not in Python's control. But this is not just newbies -- see above, deployment and test (CI) environments might need it too. Which is another good reason that having it be something that can be "turned on" by an virtual environment / requirements file would be very helpful. -Chris B. -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 2:28 AM Christopher Barker <pythonchb@gmail.com> wrote:
Unlike Windows, environment variables work very fine for such use cases. On Unix, direnv, dotenv, and maybe more tools are there. It is not only for Python, but for projects.
Which is another good reason that having it be something that can be "turned on" by an virtual environment / requirements file would be very helpful.
There are direnv and dotenv. -- Inada Naoki <songofacandy@gmail.com>

On Mon, Feb 8, 2021 at 6:11 PM Inada Naoki <songofacandy@gmail.com> wrote:
Unlike Windows, environment variables work very fine for such use cases.
Windows has environment variables, doesn't it?
On Unix, direnv, dotenv, and maybe more tools are there.
I've been around Python for decades, and have never heard of these. Is this dotenv? https://pypi.org/project/python-dotenv/ From the looks of it, it works on Windows too. Or it's dangerously mis-documented, which is kinds my point. We're talking about this because people that do their work on *nix systems deliver code that does not work correctly on Windows. I think it's MUCH better to have ONE way to do something that works, for Python, on all platforms. That way people that only know one platform can still write and document code that can work on all platforms.
There are direnv and dotenv.
It looks to me like dotenv would have to be run after Python startup -- so wouldn't help here. direnv looks nifty, but again, not Python, and I can't quite see how it would help here, it seems to be about the current working directory. You and Eryk certainly know the implementation details more than I, so I'll step back and talk about what I'd like to see: 1) Something that can be easily set up to be "environment" specific, where an environment can be a virtualenv, a venv, a pipenv (are they different??), a conda environment, or, hopefully whatever new environment system comes along. 2) Something that can be part of the standard environment creation step, not an extra step you need folks to do by hand. Ideally a package that could be put in a requirements file. That is, I could simply put "utf8_mode" in my requirements file(s) and anyone that installed those requirements into an environment would get it configured. 3) One way to do that that's the same on all platforms. I *think* this is possible. -Chris B -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 3:37 PM Christopher Barker <pythonchb@gmail.com> wrote:
But it isn't works well for Windows users. Unix and Windows have different use cases.
I think it's MUCH better to have ONE way to do something that works, for Python, on all platforms. That way people that only know one platform can still write and document code that can work on all platforms.
This thread is only for make UTF-8 mode accessible for Windows users, because UTF-8 mode helps many Windows users but it is not accessible enough for Windows users. Can you provide some realistic use cases where UTF-8 mode helps Unix users but it is not accessible? If not, please focus on helping Windows users. Time is a limited resource. I have no time to discuss about helping zero Unix users. -- Inada Naoki <songofacandy@gmail.com>

On Mon, Feb 8, 2021 at 10:49 PM Inada Naoki <songofacandy@gmail.com> wrote:
Well, there has been some talk of adding some of the other configuration options as well. But sure.
because UTF-8 mode helps many Windows users but it is not accessible enough for Windows users.
It's not just accessibility, but discoverability -- Windows users -- and even more so developers that don't generally use Windows often don't know utf-8 mode exists. That's why I'm pushing for a way to for an application developer to be able to set up their project so that it will run under utf-8 mode everywhere. With only one way, and without having to add Windows specific code or documentation. As has been discusses, there are very few cases where it would make any difference under Linux (and zero for teh Mac?) -- but why not have "one way to do it"?
Can you provide some realistic use cases where UTF-8 mode helps Unix users but it is not accessible?
It's not accessible to the application developer. It is to the deployer / devops person. These are often one and the same, but not always. My major project had exactly this problem -- the bare bones docker images used on the CI (and for deployment) were set up with an ASCII locale (or something like that) -- and our application failed. In the end we figured out how to configure the images for utf-8, but as it happens, I know Python, and don't know much Linux administration, and the linux sys admins didn't know Python much -- so it took a fair bit of back and forth to figure out. We use conda for CI and deployment -- if I had been able to put a "utf-mode" package in the conda requirements file, we wouldn't have had this issue, and our Windows users (yes we have those too) would also get their systems set up to "do the right thing" without their even knowing about it. Other folks use pipenv and the like -- it would be helpful to them if they could do the same thing with their requirements files as well.
If not, please focus on helping Windows users.
Honestly, I'm trying to help Windows users here -- see above. Honestly, my Windows users are the biggest problem (they tend to be less tech savvy -- at least I had a linux sysadmin to work with, my Windows users usually are not sysadmins). And it's not Linux users so much anyway -- it's linux developers that want to support Windows users. Remember back in the day of Python2, where opening a text file and binary file was no different on *nix? There were no shortage of bugs that didn't turn up in code tested on *nix until a Windows user came around -- but at least it was an easy fix -- as 'b' flag,and the ocde would work the same on all platforms. And in the end, if there is a single solution that can do the same thing in the same way on all platforms isn't that less maintenance and documentation work?
Time is a limited resource. I have no time to discuss about helping zero Unix users.
By all means -- spend your time on what you think is important. You asked for others' opinions, I've given mine. If you don't agree, so be it. Thanks for your work on this -- anything you do will be an improvement. -Chris -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 4:53 PM Christopher Barker <pythonchb@gmail.com> wrote:
It makes problem too hard, complex. It leads we can not fix anything at all by Python 3.10. We can add Unix support later if it is really worth enough. It is not backward incompatible change.
When using docker, it's very easy to put an environment variable. You don't need to worry about "it will break existing legacy Python application in same container." You can just create one container for one application. So I don't think it is enough reason to.add complexity. As I said before, use case of UTF-8 mode is different between Unix and Windows.
We use conda for CI and deployment -- if I had been able to put a "utf-mode" package in the conda requirements file, we wouldn't have had this issue, and our Windows users (yes we have those too) would also get their systems set up to "do the right thing" without their even knowing about it.
Other folks use pipenv and the like -- it would be helpful to them if they could do the same thing with their requirements files as well.
Without more concrete idea, such rough lead this thread to maze. Note that UTF-8 mode must be enabled before any path config on Unix. So it is almost impossible to enable UTF-8 mode using tools like pip. If your idea is just putting `python.ini` (or `python.cfg`) in bin/ or Scripts/ directory from pip/conda package, I don't think it is just a hack, not a best practice. It will cause file conflict error very easily.

Here's a good blog post about setting env vars on Windows: https://www.dowdandassociates.com/blog/content/howto-set-an-environment-vari... It's not really much harder than on Unix platforms. The only catch is that Windows users will often not know about such env vars or how to use them, because on Windows you typically set up your configuration via the application and using the registry. Perhaps we could have both: an env var to enable UTF-8 mode and a registry key set by the installer. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 09 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Exactly -- it's not so much that Windows itself has different capabilities, but that Windows conventions are different. And windows users are different -- let's face, you still need a greater level of "sophistication" to use Linux. And the Mac has a more consistent configuration guaranteed -- at least for this case, but also in general.
Perhaps we could have both: an env var to enable UTF-8 mode and a registry key set by the installer.
There already is an environment variable. As for the registry key -- much of the point of this thread is around the idea that people should generally not set it for all Python use on that machine, or that user, but rather have it be specific to the environment -- so I don't know that I we want it to be easier to set it global to the user. The point I've been pushing is that there are various people "in control" of this setting: The sysadmin The user The application developer (sometimes one or two of these roles is the same person, but not always) Clearly the sysadmin and user should have control over this setting -- so we may want to make it easier on users that may not be familiar with setting environment variables. But my focus is on the application developer: we currently have a way to specify what Python environment is needed to run an application: a requirements file. So I can specify to my users that in order to run this coe, they need to install these requirements, and the code should work. What I would like is to be able to have utf-8 mode be part of that -- and not have to document a special extra step they need to take, and even more so, not have to document that special step only on windows. It's not a huge deal, but I'd rather it be clean -- and the other nice bit is that eventually, if/when utf-8 becomes the default in a future python, this becomes a no-op and my users don't have to know anything has changed. In a way, what I'm looking for is a system-wide equivalent to a __future__import -- maybe impossible, but it'd be nice. - Chris B Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 7:42 PM M.-A. Lemburg <mal@egenix.com> wrote:
But it affects to all Python installs. Can teachers recommend to set PYTHONUTF8 environment variable for students?
I don't want to recommend env vars and registry for conda and portable Python users... -- Inada Naoki <songofacandy@gmail.com>

On Tue, 9 Feb 2021 at 17:32, Inada Naoki <songofacandy@gmail.com> wrote:
Why is that an issue? In the first instance, do the sorts of "beginner" we're discussing here have multiple python installs? Would they need per-interpreter configuration of UTF-8 mode? Honestly, I find it far harder to configure environment variables on Unix (I have to do it per *shell*, for a start). Windows users don't often set environment variables, because Windows-native applications often use other means to determine their configuration - but it's not because the user *can't* set environment variables, or because it's "too hard".
I'm not sure what you mean here. Why is this different from (say) PYTHONPATH? How would conda and portable python users configure PYTHONPATH? Why is UTF-8 mode any different? Paul

On Wed, Feb 10, 2021 at 6:02 AM Paul Moore <p.f.moore@gmail.com> wrote:
Hmm, I was afraid to break applications using existing Python in the system. But if no one cares about it, I'm ok with just adding something like "enable-utf8-mode.bat" / "disable-utf8-mode.bat".
How often PYTHONPATH is needed at all? I saw many people broke their environment by setting PYTHONPATH. I don't recommend to use it at all. On the other hand, I want to make teachers can recommend to enable UTF-8 mode for students. That is the defference between PYTHONUTF8 and PYTHONPATH. -- Inada Naoki <songofacandy@gmail.com>

On Tue, Feb 9, 2021 at 1:04 PM Paul Moore <p.f.moore@gmail.com> wrote:
yes -- many, many tutorials, particularly about web frameworks, start with "make a new virtual environment". To the point that many of my students have thought that was a requirement to use, e.g. flask. Personally, I do not start out with environments with my beginning students -- they really only need one at the early stages. But other instructors do. Others have to work with a locked down system provided by their employer that might be an older version of Python, or need some particular configuration that they don't want to override. And all the examples given here of how to set environment variables and shortcuts, etc on Windows is EXACTLY the kind of information I don't want to have to provide for my students :-( -- I'm teaching Python, not Windows administration.
I don't want to recommend env vars and registry for conda and portable
Python users...
and a lot of newbies learning Python for data science are starting out with conda as well ...
It's not -- using PYTHONPATH is a "bad idea" I never recommend it to anyone. It was a nightmare when folks have Python 2 and 3 on the same machine, but now, in the age of environments, it's still a really bad idea. It's really important to support configuration per environment these days. Ideally with any of the "environment" tools. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Wed, 10 Feb 2021 at 07:14, Christopher Barker <pythonchb@gmail.com> wrote:
So get PYTHONUTF8 added to the environment activate script. That's a simple change to venv. And virtualenv, and conda - yes, it need to happen in multiple places, but that's still easier IMO than proposing a change to Python's already complex (and slower than many of us would like) startup process.
So teach Python as it actually is, surely? If you teach people how to use "Python-with-UTF8-mode", won't they struggle when introduced to the real world where UTF8 mode isn't set? Won't they assume the default encoding for open() is UTF-8, and be confused when they are wrong? Yes, I know your job as an instructor is to omit confusing details, and UTF8 mode would help with that. I get that. But that's just one case. And anyway, would you not have to explain how to set UTF-8 mode for the training environment one way or another anyway? Sure, you may not have to explain how to set an environment variable. But you have to explain how to configure an ini file instead. Unless UTF-8 mode is the default, you have to explain how to configure the training environment one way or another - unless you provide a pre-packaged environment (in which case we're back to why not just set an env variable).
So conda could set UTF-8 mode with "conda env --new --utf8". No changes to core Python interpreter startup needed.
Sure, PYTHONPATH was just an example. Environment variables are how you configure Python in many ways. I'm asking why UTF-8 mode is so special it needs a different configuration mechanism than every other setting for Python.
It's really important to support configuration per environment these days. Ideally with any of the "environment" tools.
That's a completely different discussion, and as you stated it, doesn't just apply to UTF-8 mode. It should be a different thread. And my immediate answer would be that you can do this by changing the activation scripts. Yes, that means each environment tool needs to be updated individually, but that would be a reasonable start. If the feature proves important, it could later be migrated into a core feature. Paul

On Wed, Feb 10, 2021 at 5:33 PM Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure this idea works fine. Is the activate script always called when venv is used on Windows? When I use venv on Unix, I often just execute .venv/bin/some-script without activating the venv.
Students may need to learn about encoding at some point. But when they learn "how to read/write file" first time, they don't need to know what encoding is. VSCode, notepad, PyCharm use UTF-8 by default. Students don't need to learn how to use encoding other than UTF-8 until really need it.
We can add "Enable the UTF-8 mode" checkbox to the installer. And we can have "Enable the UTF-8 mode" tool in the start menu. So students don't need to edit the ini file manually. The problem is; should we recommend to enable UTF-8 mode globally by setting environment variable, or provide a per-site UTF-8 mode setting?
They may not want to promote UTF-8 mode until official Python promote UTF-8 mode. So I think venv should support UTF-8 mode first.
Because it solves many real world problem that many Windows users suffer. -- Inada Naoki <songofacandy@gmail.com>

On Wed, 10 Feb 2021 at 11:01, Inada Naoki <songofacandy@gmail.com> wrote:
So in your training course, tell users to activate the environment. Experienced users (like you) who can run scripts directly aren't the target of this change, are they? This is one of the frustrating points here, I'm not clear who the target is. When I say it wouldn't help me, I'm told I'm not the target. When I suggest an alternative, it apparently isn't useful because it wouldn't work for you...
Agreed.
If they only use ASCII files and a system codepage that is the same as ASCII for the first 127 characters, they it's irrelevant. If they read data from a legacy system, that is quite likely to be in the system codepage (most of the local files I use at work, for example, are not UTF-8). So I'd say that many students don't need to learn how to use *any* encoding until they need it. But I'm not a professional trainer, so my experience is limited.
Those options could set the environment variable. After all, that's what "Add Python to PATH" does, and people seem OK with that. No need for an ini file (that adds an extra file read to the startup time, as has already been mentioned as a downside).
What precisely do you mean by "per site"? Do you mean "per Python interpreter"? Do you view separate virtual environments as "sites"? Again, I don't understand who the target audience is here.
That's fair enough. Although I'd like to point out the parallel here - you're saying "environment tools might not want to make UTF8 the default until Python does". I'm saying "Python might not want to make UTF8 the default until the OS does". I'm not completely sure why your argument is stronger than mine :-)
Because it solves many real world problem that many Windows users suffer.
OK. My experience differs, but that's fine. But why wasn't this a consideration when UTF8 mode was first designed? At that point, an interpreter flag and an environment variable were considered sufficient. Why is that no longer true? Is it because the initial design of UTF8 mode ignored Windows? Why, if this is such a Windows-specific problem? Sigh. To be honest, I don't have the time (or the interest) to go back over all the history here. I think I'm just going to have to drop this discussion and wait to comment when a concrete proposal is put forward. PEP 597 is the only actual PEP on the table at the moment, everything else is just speculation, and I really can't keep up with the volume of discussion in the various threads. Paul

On Wed, Feb 10, 2021 at 8:39 PM Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure here. It's not my training course. Target user is thousands of students. They may don't use command prompt at all.
I'm sorry about it. I didn't mean "it don't work for me". I meant just I am not sure activation script is always executed. I looked vscode-python and found it execute the activation script. I am not sure about PyCharm yet, but it works if they works like vscode-python. Another story is clicking .exe files in the Scripts/ directory. But it can be fixed by changing only the launcher exe. Adding per-venv UTF-8 mode is one attractive option. We can keep python.exe untouched.
But students don't know what is ASCII yet.
One installation is one site. One venv is one site. One conda env is one site. I don't know proper term for it, but I call it "site" because all of them have one "site-packages".
Oh, I don't propose changing the default encoding for now. Microsoft provides "Beta: use unicode UTF-8 for worldwide language support in my PC" option. It affects to all application. It is similar to global PYTHONUTF8 environment variable. Microsoft provides UTF-8 code page (*) too. It affects only one application. It is similar to per-site UTF-8 mode idea. (*) https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... So what i am proposing is not more aggressive than Microsoft. Microsoft provides similar options already.
When I accepted the UTF-8 mode, main target is server application. Some Unix server OS (especially "minimal" container images) only have C locale. Since target users are server side programmers, command-line arg and environment variable are enough. I knew UTF-8 mode is interesting for Windows too. But Windows users were not main target when I accepted it. After UTF-8 mode is shipped, I noticed UTF-8 mode is very nice for Windows users who learning Python.
Why, if this is such a Windows-specific problem?
For Unix (macOS, iPadOS, Android, ChromeOS, and Linux) desktop users, they uses UTF-8 locale already. Students can learn Python in "UTF-8 is default" environment. UTF-8 mode is used for server applications running in C locale. Server side programmers are familar with command line and environment variables. On the other hand, Most students learning Python on Windows are not server-side programmer. They are not familar with command line and environment variables. And they are suffered by UnicodeError for now, because the default encoding for text files are not UTF-8. That is the key difference.
I'm sorry about it. I have not chose actual implementation yet so I can not write concrete PEP yet. -- Inada Naoki <songofacandy@gmail.com>

On Wed, 10 Feb 2021 at 13:31, Inada Naoki <songofacandy@gmail.com> wrote:
I'm sorry about it. I have not chose actual implementation yet so I can not write concrete PEP yet.
It's not a problem. I appreciate all of the time you're putting into considering the responses and keeping the discussion going. (And please don't think I was criticising the decision over UTF-8 mode, I genuinely didn't know the background, and "it was targeted at server environments" answers that question for me). I'm dropping out of the discussion because I can't afford the time to make sure I'm not forcing you to go over things that have already been discussed, and I don't want to waste your time by doing that. But I await the results of the discussion with interest :-) Paul

On Wed, Feb 10, 2021 at 12:33 AM Paul Moore <p.f.moore@gmail.com> wrote:
So get PYTHONUTF8 added to the environment activate script. That's a simple change to venv. And virtualenv, and conda
That's probably a good solution for venv and virtualenv -- essentially add it as another environment creation option. but conda, not so much. Conda can manage everything, not just python. YOU can create an conda environment with no Python at all in it. So UTF-8 mode is not really a configuration of the environment itself, but rather a configuration of the Python package. It's also the philosophy of conda that it essentially installs stuff built in the usual way -- conda-build is pretty stupid actually. GRanted, it does indeed have a lot of special case stuff for Python, so this probably could be done, but I'd rather see this kind of Python configuration done in a more friendly manner to third party package managers (anyone know anything about chocolaty, for instance?) -CHB NOTE: I use conda a lot, but DO NOT know all it's ins and outs -- it may be possible to set environment variables with a package install -- given that the word "environment" shows up all over the place in conda docs, Google has not helped me answer that question.
a change to Python's already complex (and slower than many of us would like) startup process.
That IS a real issue, yes :-(
And anyway, would you not have to explain how to set UTF-8 mode for the training environment one way or another anyway?
That's why I'd like "one way to do it" on all platforms -- see other parts of this thread. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 2/11/21, Christopher Barker <pythonchb@gmail.com> wrote:
Note that using a virtual environment does not require activation. A script can be deployed to run in a virtual environment by referring to the environment's executable in a shebang line, e.g.: #!path\to\venv\Scripts\python.exe Or with a Windows shell link that runs path\to\venv\Scripts\python.exe path\to\script.py Setting PYTHONUTF8 in the activate script does nothing to educate users about the default encoding in other contexts. The REPL shell could print a short message at startup that informs the user that Python is using UTF-8 mode, including a link to a web page that explains this in more detail.

I looked some Python courses for children. They won't use venvs. For example, they put .py file in a specified directory, then run it in the Minecraft or other graphical applications. Now I think we should promote putting PYTHONUTF8=1 in user environment before thinking about complex per-site ideas. Since its user environment variable, it won't break legacy applications running in a parent account. Does anyone against adding "Enable UTF-8 mode" in the Start menu? -- Inada Naoki <songofacandy@gmail.com>

On 10.02.2021 08:15, Christopher Barker wrote:
That's fair, but please note that the idea is to have the Python installer take care of setting the env var globally, so no admin or user would need to bother with this.
It's really important to support configuration per environment these days. Ideally with any of the "environment" tools.
True, and those can easily override any globally set env vars. Note that you can set PYTHONUTF8=0 to disable and already globally set PYTHONUTF8=1. conda could manage this on a per env basis. venv could as well, via the .bat or .ps1 files to activate the environment on Windows. So technically, env vars are indeed an easy way to enable UTF-8 mode on a per installation and per venv basis, on all platforms Python supports. What's best: only tooling and installers would need to be adapted, not Python itself, since the UTF-8 mode and env var has already been around for quite some time. And for those who don't want to wait: setx PYTHONUTF8 1 does the trick in an admin command shell on Windows globally. set PYTHONUTF8 1 does the same locally in a user command shell or as part of the venv activate.bat file. It's not really all that hard :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 10 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/10/21, M.-A. Lemburg <mal@egenix.com> wrote:
setx PYTHONUTF8 1
does the trick in an admin command shell on Windows globally.
The above command sets the variable only for the current user, which I'd recommend anyway. It does not require administrator access. To set a machine value, run `setx /M PYTHONUTF8 1`, which of course requires administrator access. Also, run `set PYTHONUTF8=1` in CMD or `$env:PYTHONUTF8=1` in PowerShell to set the variable in the current shell. Unrelated to UTF-8 mode and long-term plans to make UTF-8 the preferred encoding, what I want, from the perspective of writing applications and scripts (not libraries), is a -X option and/or environment variable to make local._get_locale_encoding() behave like it does in POSIX. It should return the LC_CTYPE codeset of the current locale, not just the default locale. This would allow setlocale() in Windows to change the default for encoding=None, just as it does in POSIX. Technically it's not hard to implement in a way that's as reliable as nl_langinfo(CODESET) in POSIX. The code page of the current CRT locale is a public field. In Windows 10 the CRT has supported UTF-8 for 3 years -- regardless of the process active code page returned by GetACP(). Just call setlocale(LC_CTYPE, ".UTF-8") or setlocale(LC_CTYPE, (getdefaultlocale()[0], 'UTF-8')).

On 10.02.2021 23:10, Eryk Sun wrote:
Thanks for the correction.
That's what getlocale(LC_CTYPE) is intended for, unless I'm missing something. getdefaultlocale(), which uses _locale._getdefaultlocale() on Windows, is meant to determine the locale settings, setlocale(locale.LC_ALL, '') would be setting for the current process, without actually doing this. The reason we have this API is because setlocale() is not thread-safe and could therefore cause problems in other threads when simply trying to call setlocale(locale.LC_ALL, '') and then reset this again if needed.
I think the main problem here is that open() doesn't use locale.getlocale()[1] as default for the encoding parameter, but instead locale.getpreferredencoding(False). The latter doesn't change when you adjust the locale for the current process on Windows:
On Linux, locale.getpreferredencoding(False) does return changes made using setlocale(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 11 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
Currently, locale.getpreferredencoding(False) is implemented as locale._get_locale_encoding(). This ultimately calls _Py_GetLocaleEncoding(), defined in "Python/fileutils.c". TextIOWrapper() calls this C function to get the encoding to use when encoding=None is passed. In POSIX, _Py_GetLocaleEncoding() calls nl_langinfo(CODESET), which returns the current LC_CTYPE encoding, not the default LC_CTYPE encoding. For example, in Linux: >>> setlocale(LC_CTYPE, 'en_US.UTF-8') 'en_US.UTF-8' >>> _get_locale_encoding() 'UTF-8' >>> open('test.txt').encoding 'UTF-8' >>> setlocale(LC_CTYPE, 'en_US.ISO-8859-1') 'en_US.ISO-8859-1' >>> _get_locale_encoding() 'ISO-8859-1' >>> open('test.txt').encoding 'ISO-8859-1' In Windows, _Py_GetLocaleEncoding() just uses GetACP(), which returns the process ANSI code page. This is based on the CRT's default locale set by setlocale(LC_CTYPE, ""), which combines the user's default locale with the process ANSI code page. I'm not overjoyed about this combination in the default locale, since it's potentially inconsistent (e.g. Korean user locale with Latin 1252 process code page), but that ship sailed a long time ago. I'm not arguing to change locale.getdefaultlocale(). The problem is that locale._get_locale_encoding() in Windows is not returning the current LC_CTYPE locale encoding, in contrast to how it behaves in POSIX. I'd like an environment variable and/or -X option to fix this flaw. If enabled, and if the C runtime supports UTF-8 locales (as it has for the past 3 years in Windows 10), and the application warrants it (e.g. many open calls across many modules), then convenient use of UTF-8 would be one setlocale() call away. It's not for packages. Frankly, I don't see why it's a problem for a package developer to use encoding='utf-8' for files that need to use UTF-8. Developing libraries that are designed to work in arbitrary applications on multiple platforms is tedious work. Having to explicitly pass encoding='utf-8' goes with the territory, and it's a minor annoyance in the grand scheme of things.
That's what getlocale(LC_CTYPE) is intended for, unless I'm missing something.
getlocale() can't be relied on to parse the correct codeset from the locale name, and it can even raise ValueError (more likely in Windows, e.g. with the native locale name "en-US"). The codeset should be queried directly using an API call, such as nl_langinfo(CODESET) in POSIX. In Windows, the C runtime's POSIX locale implementation doesn't include nl_langinfo(). There's ___lc_codepage_func(), but it's documented as an internal function. A ucrt locale record, however, does expose the code page as a public field, as documented in the public header "corecrt.h". Here's a prototype using ctypes: import os import ctypes ucrt = ctypes.CDLL('ucrtbase', use_errno=True) class _crt_locale_data_public(ctypes.Structure): _fields_ = (('_locale_pctype', ctypes.POINTER(ctypes.c_ushort)), ('_locale_mb_cur_max', ctypes.c_int), ('_locale_lc_codepage', ctypes.c_uint)) class _crt_locale_pointers(ctypes.Structure): _fields_ = (('locinfo', ctypes.POINTER(_crt_locale_data_public)), ('mbcinfo', ctypes.c_void_p)) ucrt._get_current_locale.restype = ctypes.POINTER(_crt_locale_pointers) CP_UTF8 = 65001 def _get_locale_encoding(): locale = ucrt._get_current_locale() if not locale: errno = ctypes.get_errno() raise OSError(errno, os.strerror(errno)) try: codepage = locale[0].locinfo[0]._locale_lc_codepage finally: ucrt._free_locale(locale) if codepage == 0: return 'latin-1' # "C" locale if codepage == CP_UTF8: return 'utf-8' return f'cp{cp}' Examples with Python 3.9 in Windows 10: >>> setlocale(LC_CTYPE, 'C') 'C' >>> _get_locale_encoding() 'latin-1' >>> setlocale(LC_CTYPE, 'en_US') 'en_US' >>> _get_locale_encoding() 'cp1252' >>> setlocale(LC_CTYPE, 'el_GR') 'el_GR' >>> _get_locale_encoding() 'cp1253' >>> setlocale(LC_CTYPE, 'en_US.utf-8') 'en_US.utf-8' >>> _get_locale_encoding() 'utf-8'

On 11.02.2021 13:49, Eryk Sun wrote:
All that seems to be new in Python 3.10. This is not what's happening in Python 3.9. The _get_locale_encoding() function doesn't even exist.
Why an env variable ? You could simply open up a ticket to get this fixed, since 3.10 is not released yet.
Here's what I get with Python 3.9 on Windows 10:
Note that _get_locale_encoding() is not available, so using getlocale() instead. The returned values for the encoding look mostly correct to me, except the one for the 'C' locale which should be 'ascii'. getpreferredencoding() doesn't honor those changes, though. It returns 'cp1252' for me, or 'UTF-8' when using UTF-8 mode. Now, if I explicitly set the locale, I'd expect this to be used by Python for I/O as well. This currently doesn't happen and that's confusing:
Anyway, UTF-8 mode is the way to go these days, esp. if you want to write applications which are portable across platforms and behave the same on all. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 11 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
In previous versions, locale.getpreferredencoding(False) is functionally the same. In 3.10, the latter is implemented in C via locale._get_locale_encoding().
Why an env variable ? You could simply open up a ticket to get this fixed, since 3.10 is not released yet.
I thought it would be best to let users/administrators opt in to POSIX behavior. But maybe it should require opting out.
Windows code pages 1252 and 1253 are not the same as ISO-8859-1 and ISO-8859-7. getlocale() is just looking up the encoding of "en_US" and "el_GR" from the mapping in the locale module. That kind of best-guess result isn't right for locale._get_locale_encoding().
The returned values for the encoding look mostly correct to me, except the one for the 'C' locale which should be 'ascii'.
The "C" locale in the Windows CRT uses Latin-1 for LC_CTYPE. This is implemented for mbstowcs() by casting from char to wchar_t. It's similar for wcstombs(), and limited to Unicode ordinals below 256. However, the "C" locale isn't consistently Latin-1 across other categories. IIRC, LC_TIME in the "C" locale uses the process ANSI code page for time-zone names, and mojibake is common.
Globally setting PYTHONUTF8 forces all scripts to use UTF-8 as the default for open(). I'd like to let scripts opt in to using UTF-8 as the default for open() by way of an explicit setlocale() call such as setlocale(LC_CTYPE, (getdefaultlocale()[0], "UTF-8")) or, Windows only, setlocale(LC_CTYPE, ".UTF-8"). In POSIX, Python already tries coercing the "C" and "POSIX" locales (usually ASCII) to use UTF-8.

On 2/9/21, Inada Naoki <songofacandy@gmail.com> wrote:
Users can simply create a shortcut that targets `cmd /k set PYTHONUTF8=1`. Optionally change the shortcut's "start in" directory to the desired working directory.
Command-line modification of the persistent environment is rarely required. Using setx.exe is okay for setting simple variables in CMD [1], such as `setx PYTHONUTF8 1`, combined with `set PYTHONUTF8=1` for the current shell. To do this in the GUI in Windows 10, click on the start button (or tap the WIN key) to show the start menu; type "environ"; and click on "Edit environment variables for your account". In the window that opens, click the "New" button; type "PYTHONUTF8" as the name and "1" (without quotes) as the value. Click the "OK" button on the dialog, and then click the "OK" button on the editor window. To test the value, assuming you have the py launcher installed, press WIN+R to open the run dialog. Type "py", and in the Python shell confirm that executing `import locale; locale.getpreferredencoding()` returns 'UTF-8'. --- [1] I would feel remiss in discussing "setx.exe" without warning about naively trying to modify PATH. For example, DO NOT execute a command like `setx.exe PATH "C:\Program Files\Python39;%PATH%"`. This is wrong because it sets the current PATH value, including the system part, as the user "Path" value, truncated to 1024 characters, and without the original dependence on system variables and independent (REG_SZ) user variables. Properly modifying the persistent "Path" from CMD is difficult and requires careful use of both reg.exe and setx.exe. It's easier in PowerShell. It's far easier to use the GUI editor, which in Windows 10 even provides an exploded list view that makes it simple to add/remove directories and move them up and down in the list.

On Tue, Feb 9, 2021 at 12:28 AM Inada Naoki <songofacandy@gmail.com> wrote:
yes and no -- if we don't anticipate supporting Unix, then we may well come up with a solution that won't work well there. And if we do have a solution that will work well there, whynot turn it on? But anyway, you are quite right that it's a very narrow use case, so if you think it'll really hold things up then we can abandon that.
Indeed, that is kind of the point of Docker :-) But the same issue could (though less likely) comeup for other *nix deployments -- I'd still like to have one place to specify what my Python application needs to run. Not a huge deal, but would be good. Without more concrete idea, such rough lead this thread to maze.
Well, I've been trying to get help with a more concrete idea in this thread. That's why I've specified what I think are the requirements -- we can't have a solution without agreeing on the requirements first. I'm still not sure if the requirement to make it easily installable into an environment without an extra step hasn't been discussed because it's technically impossible / difficult, or if no one else thinks it's worth doing at all. I guess either way, it's time to abandon the idea.
Note that UTF-8 mode must be enabled before any path config on Unix. So it is almost impossible to enable UTF-8 mode using tools like pip.
That is the challenge, yes. But can pip put files outside of site-packages? I suspect not.
That is my idea -- at least for conda. But would it cause any more conflict than installing a package of a particular version? But this might be a case for using the pyenv.cfg file -- that IS intended to be manipulated by the environment tool. Though yes, having it looked for outside of the dir where python lives is not good. Have you arrived at a concrete proposal at this point? -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, Feb 7, 2021 at 4:16 PM Eryk Sun <eryksun@gmail.com> wrote:
OK, then, how about just same to python.exe? In this case, we need to put python.ini in Scripts directory for venvs. It seems a bit odd, but much simpler than looking in the parent directory.
Thank you, I didn't know that. If we need to search a parent directory, we need to check ._pth too.
Should we support it in Unix? I don't think so. Command-line and environment variables are easy to use on Unix. And beginners should use a UTF-8 locale.
I know that. But I don't think it's enough reason to put a new config file to user profile. If users don't have system privilege, they can still install another Python. Config file in user profile is fragile. If all venvs start using profile directory, it become unmaintainable soon. We can just recommend per-user install for new users. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Sun, Feb 7, 2021 at 3:58 PM Inada Naoki <songofacandy@gmail.com> wrote:
Chiefly, I don't want to overload "pyvenv.cfg" with new behavior that's unrelated to virtual environments.
This is my point -- this is NOT unrelated to virtual environments -- UTF-8 mode, and other configuration parameters are very much part of the (generic term) environment. The whole point here is to be able to set a configuration on a virtual environment. And I think that the venv tool should probably grow a feature to turn it on (or off). So my take is that we have pyvenv.cfg already, so why not use it for all the configuration one might want for a particular "environment".
I also dislike the way this file is found.
...
Indeed, that is unfortunate. And may well make this impossible -- I agree that a general configuration file shouldn't be found there. Oh well -- more config files it is! OK, then, how about just same to python.exe?
In this case, we need to put python.ini in Scripts directory for venvs. It seems a bit odd, but much simpler than looking in the parent directory.
I think that would work.
The issue (to me anyway) is not where is it, but rather the whole idea of putting it outside python, and in the user's space at all. Should we support it in Unix? I don't think so.
Command-line and environment variables are easy to use on Unix.
maybe, but we have many of the same issues -- we want the configuration tied to the environment, not to the user and all environments. And I'd rather have things done the same way on all platforms, rather than the native way on each platform, if I have to make a choice. That is, if there is a way to configure Python on Windows, I'd really like the SAME way to be available on all platforms.
And beginners should use a UTF-8 locale.
Beginners may not know how to do that / have a choice. This is a question I still don't know the answer to -- I think that most (all?) non Windows platforms currently supported use utf-8 -- but is that guaranteed? That is, might some platform come up that does need utf-8 mode? So why not have it available everywhere, even though it will be a no-op on most systems.
exactly -- I"m trying to imagine a case where a user doesn't have read access to the place the python.exe is, but DOES need to override this one thing. That is, a user can either control the python install they are using or they can't.
Config file in user profile is fragile. If all venvs start using profile directory, it become unmaintainable soon.
exactly -- if this is added, I will certianly not recommend anyone use it.
We can just recommend per-user install for new users.
or a virtual environment :-) Thanks Eryk for bringing clarity to these issues. -Chris B. -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Feb 8, 2021 at 3:58 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Unix, there are N ways (e.g. .envrc). N+1 way is really worthwhile? At least, `python.cfg` (or `python.ini`) in bin/ directory is not good for Unix environment.
UTF-8 mode is provided for Unix because there is environments for *deployment*, like minimal Unix container image. They have only C locale. For desktop use, I think all Unix environments suited for beginners use UTF-8 locale by default. There is no guarantee. But if default locale is not UTF-8, I don't think the environment is suited for beginners who learning to Python. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Sun, Feb 7, 2021 at 11:19 PM Inada Naoki <songofacandy@gmail.com> wrote:
yes -- I much prefer a "this is how you do it for Python" than a bunch of platform specific details. And is there a good way to do it for environments (of various sorts) ? At least, `python.cfg` (or `python.ini`) in bin/ directory is not good
for Unix environment.
hmm -- that is true (though it is THAT bad ?!?), though it would be fine for virtual environments. And As has been mentioned many times on this is generally not a great configuration to set globally anyway.
That's true, but not in Python's control. But this is not just newbies -- see above, deployment and test (CI) environments might need it too. Which is another good reason that having it be something that can be "turned on" by an virtual environment / requirements file would be very helpful. -Chris B. -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 2:28 AM Christopher Barker <pythonchb@gmail.com> wrote:
Unlike Windows, environment variables work very fine for such use cases. On Unix, direnv, dotenv, and maybe more tools are there. It is not only for Python, but for projects.
Which is another good reason that having it be something that can be "turned on" by an virtual environment / requirements file would be very helpful.
There are direnv and dotenv. -- Inada Naoki <songofacandy@gmail.com>

On Mon, Feb 8, 2021 at 6:11 PM Inada Naoki <songofacandy@gmail.com> wrote:
Unlike Windows, environment variables work very fine for such use cases.
Windows has environment variables, doesn't it?
On Unix, direnv, dotenv, and maybe more tools are there.
I've been around Python for decades, and have never heard of these. Is this dotenv? https://pypi.org/project/python-dotenv/ From the looks of it, it works on Windows too. Or it's dangerously mis-documented, which is kinds my point. We're talking about this because people that do their work on *nix systems deliver code that does not work correctly on Windows. I think it's MUCH better to have ONE way to do something that works, for Python, on all platforms. That way people that only know one platform can still write and document code that can work on all platforms.
There are direnv and dotenv.
It looks to me like dotenv would have to be run after Python startup -- so wouldn't help here. direnv looks nifty, but again, not Python, and I can't quite see how it would help here, it seems to be about the current working directory. You and Eryk certainly know the implementation details more than I, so I'll step back and talk about what I'd like to see: 1) Something that can be easily set up to be "environment" specific, where an environment can be a virtualenv, a venv, a pipenv (are they different??), a conda environment, or, hopefully whatever new environment system comes along. 2) Something that can be part of the standard environment creation step, not an extra step you need folks to do by hand. Ideally a package that could be put in a requirements file. That is, I could simply put "utf8_mode" in my requirements file(s) and anyone that installed those requirements into an environment would get it configured. 3) One way to do that that's the same on all platforms. I *think* this is possible. -Chris B -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 3:37 PM Christopher Barker <pythonchb@gmail.com> wrote:
But it isn't works well for Windows users. Unix and Windows have different use cases.
I think it's MUCH better to have ONE way to do something that works, for Python, on all platforms. That way people that only know one platform can still write and document code that can work on all platforms.
This thread is only for make UTF-8 mode accessible for Windows users, because UTF-8 mode helps many Windows users but it is not accessible enough for Windows users. Can you provide some realistic use cases where UTF-8 mode helps Unix users but it is not accessible? If not, please focus on helping Windows users. Time is a limited resource. I have no time to discuss about helping zero Unix users. -- Inada Naoki <songofacandy@gmail.com>

On Mon, Feb 8, 2021 at 10:49 PM Inada Naoki <songofacandy@gmail.com> wrote:
Well, there has been some talk of adding some of the other configuration options as well. But sure.
because UTF-8 mode helps many Windows users but it is not accessible enough for Windows users.
It's not just accessibility, but discoverability -- Windows users -- and even more so developers that don't generally use Windows often don't know utf-8 mode exists. That's why I'm pushing for a way to for an application developer to be able to set up their project so that it will run under utf-8 mode everywhere. With only one way, and without having to add Windows specific code or documentation. As has been discusses, there are very few cases where it would make any difference under Linux (and zero for teh Mac?) -- but why not have "one way to do it"?
Can you provide some realistic use cases where UTF-8 mode helps Unix users but it is not accessible?
It's not accessible to the application developer. It is to the deployer / devops person. These are often one and the same, but not always. My major project had exactly this problem -- the bare bones docker images used on the CI (and for deployment) were set up with an ASCII locale (or something like that) -- and our application failed. In the end we figured out how to configure the images for utf-8, but as it happens, I know Python, and don't know much Linux administration, and the linux sys admins didn't know Python much -- so it took a fair bit of back and forth to figure out. We use conda for CI and deployment -- if I had been able to put a "utf-mode" package in the conda requirements file, we wouldn't have had this issue, and our Windows users (yes we have those too) would also get their systems set up to "do the right thing" without their even knowing about it. Other folks use pipenv and the like -- it would be helpful to them if they could do the same thing with their requirements files as well.
If not, please focus on helping Windows users.
Honestly, I'm trying to help Windows users here -- see above. Honestly, my Windows users are the biggest problem (they tend to be less tech savvy -- at least I had a linux sysadmin to work with, my Windows users usually are not sysadmins). And it's not Linux users so much anyway -- it's linux developers that want to support Windows users. Remember back in the day of Python2, where opening a text file and binary file was no different on *nix? There were no shortage of bugs that didn't turn up in code tested on *nix until a Windows user came around -- but at least it was an easy fix -- as 'b' flag,and the ocde would work the same on all platforms. And in the end, if there is a single solution that can do the same thing in the same way on all platforms isn't that less maintenance and documentation work?
Time is a limited resource. I have no time to discuss about helping zero Unix users.
By all means -- spend your time on what you think is important. You asked for others' opinions, I've given mine. If you don't agree, so be it. Thanks for your work on this -- anything you do will be an improvement. -Chris -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 4:53 PM Christopher Barker <pythonchb@gmail.com> wrote:
It makes problem too hard, complex. It leads we can not fix anything at all by Python 3.10. We can add Unix support later if it is really worth enough. It is not backward incompatible change.
When using docker, it's very easy to put an environment variable. You don't need to worry about "it will break existing legacy Python application in same container." You can just create one container for one application. So I don't think it is enough reason to.add complexity. As I said before, use case of UTF-8 mode is different between Unix and Windows.
We use conda for CI and deployment -- if I had been able to put a "utf-mode" package in the conda requirements file, we wouldn't have had this issue, and our Windows users (yes we have those too) would also get their systems set up to "do the right thing" without their even knowing about it.
Other folks use pipenv and the like -- it would be helpful to them if they could do the same thing with their requirements files as well.
Without more concrete idea, such rough lead this thread to maze. Note that UTF-8 mode must be enabled before any path config on Unix. So it is almost impossible to enable UTF-8 mode using tools like pip. If your idea is just putting `python.ini` (or `python.cfg`) in bin/ or Scripts/ directory from pip/conda package, I don't think it is just a hack, not a best practice. It will cause file conflict error very easily.

Here's a good blog post about setting env vars on Windows: https://www.dowdandassociates.com/blog/content/howto-set-an-environment-vari... It's not really much harder than on Unix platforms. The only catch is that Windows users will often not know about such env vars or how to use them, because on Windows you typically set up your configuration via the application and using the registry. Perhaps we could have both: an env var to enable UTF-8 mode and a registry key set by the installer. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 09 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Exactly -- it's not so much that Windows itself has different capabilities, but that Windows conventions are different. And windows users are different -- let's face, you still need a greater level of "sophistication" to use Linux. And the Mac has a more consistent configuration guaranteed -- at least for this case, but also in general.
Perhaps we could have both: an env var to enable UTF-8 mode and a registry key set by the installer.
There already is an environment variable. As for the registry key -- much of the point of this thread is around the idea that people should generally not set it for all Python use on that machine, or that user, but rather have it be specific to the environment -- so I don't know that I we want it to be easier to set it global to the user. The point I've been pushing is that there are various people "in control" of this setting: The sysadmin The user The application developer (sometimes one or two of these roles is the same person, but not always) Clearly the sysadmin and user should have control over this setting -- so we may want to make it easier on users that may not be familiar with setting environment variables. But my focus is on the application developer: we currently have a way to specify what Python environment is needed to run an application: a requirements file. So I can specify to my users that in order to run this coe, they need to install these requirements, and the code should work. What I would like is to be able to have utf-8 mode be part of that -- and not have to document a special extra step they need to take, and even more so, not have to document that special step only on windows. It's not a huge deal, but I'd rather it be clean -- and the other nice bit is that eventually, if/when utf-8 becomes the default in a future python, this becomes a no-op and my users don't have to know anything has changed. In a way, what I'm looking for is a system-wide equivalent to a __future__import -- maybe impossible, but it'd be nice. - Chris B Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, Feb 9, 2021 at 7:42 PM M.-A. Lemburg <mal@egenix.com> wrote:
But it affects to all Python installs. Can teachers recommend to set PYTHONUTF8 environment variable for students?
I don't want to recommend env vars and registry for conda and portable Python users... -- Inada Naoki <songofacandy@gmail.com>

On Tue, 9 Feb 2021 at 17:32, Inada Naoki <songofacandy@gmail.com> wrote:
Why is that an issue? In the first instance, do the sorts of "beginner" we're discussing here have multiple python installs? Would they need per-interpreter configuration of UTF-8 mode? Honestly, I find it far harder to configure environment variables on Unix (I have to do it per *shell*, for a start). Windows users don't often set environment variables, because Windows-native applications often use other means to determine their configuration - but it's not because the user *can't* set environment variables, or because it's "too hard".
I'm not sure what you mean here. Why is this different from (say) PYTHONPATH? How would conda and portable python users configure PYTHONPATH? Why is UTF-8 mode any different? Paul

On Wed, Feb 10, 2021 at 6:02 AM Paul Moore <p.f.moore@gmail.com> wrote:
Hmm, I was afraid to break applications using existing Python in the system. But if no one cares about it, I'm ok with just adding something like "enable-utf8-mode.bat" / "disable-utf8-mode.bat".
How often PYTHONPATH is needed at all? I saw many people broke their environment by setting PYTHONPATH. I don't recommend to use it at all. On the other hand, I want to make teachers can recommend to enable UTF-8 mode for students. That is the defference between PYTHONUTF8 and PYTHONPATH. -- Inada Naoki <songofacandy@gmail.com>

On Tue, Feb 9, 2021 at 1:04 PM Paul Moore <p.f.moore@gmail.com> wrote:
yes -- many, many tutorials, particularly about web frameworks, start with "make a new virtual environment". To the point that many of my students have thought that was a requirement to use, e.g. flask. Personally, I do not start out with environments with my beginning students -- they really only need one at the early stages. But other instructors do. Others have to work with a locked down system provided by their employer that might be an older version of Python, or need some particular configuration that they don't want to override. And all the examples given here of how to set environment variables and shortcuts, etc on Windows is EXACTLY the kind of information I don't want to have to provide for my students :-( -- I'm teaching Python, not Windows administration.
I don't want to recommend env vars and registry for conda and portable
Python users...
and a lot of newbies learning Python for data science are starting out with conda as well ...
It's not -- using PYTHONPATH is a "bad idea" I never recommend it to anyone. It was a nightmare when folks have Python 2 and 3 on the same machine, but now, in the age of environments, it's still a really bad idea. It's really important to support configuration per environment these days. Ideally with any of the "environment" tools. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Wed, 10 Feb 2021 at 07:14, Christopher Barker <pythonchb@gmail.com> wrote:
So get PYTHONUTF8 added to the environment activate script. That's a simple change to venv. And virtualenv, and conda - yes, it need to happen in multiple places, but that's still easier IMO than proposing a change to Python's already complex (and slower than many of us would like) startup process.
So teach Python as it actually is, surely? If you teach people how to use "Python-with-UTF8-mode", won't they struggle when introduced to the real world where UTF8 mode isn't set? Won't they assume the default encoding for open() is UTF-8, and be confused when they are wrong? Yes, I know your job as an instructor is to omit confusing details, and UTF8 mode would help with that. I get that. But that's just one case. And anyway, would you not have to explain how to set UTF-8 mode for the training environment one way or another anyway? Sure, you may not have to explain how to set an environment variable. But you have to explain how to configure an ini file instead. Unless UTF-8 mode is the default, you have to explain how to configure the training environment one way or another - unless you provide a pre-packaged environment (in which case we're back to why not just set an env variable).
So conda could set UTF-8 mode with "conda env --new --utf8". No changes to core Python interpreter startup needed.
Sure, PYTHONPATH was just an example. Environment variables are how you configure Python in many ways. I'm asking why UTF-8 mode is so special it needs a different configuration mechanism than every other setting for Python.
It's really important to support configuration per environment these days. Ideally with any of the "environment" tools.
That's a completely different discussion, and as you stated it, doesn't just apply to UTF-8 mode. It should be a different thread. And my immediate answer would be that you can do this by changing the activation scripts. Yes, that means each environment tool needs to be updated individually, but that would be a reasonable start. If the feature proves important, it could later be migrated into a core feature. Paul

On Wed, Feb 10, 2021 at 5:33 PM Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure this idea works fine. Is the activate script always called when venv is used on Windows? When I use venv on Unix, I often just execute .venv/bin/some-script without activating the venv.
Students may need to learn about encoding at some point. But when they learn "how to read/write file" first time, they don't need to know what encoding is. VSCode, notepad, PyCharm use UTF-8 by default. Students don't need to learn how to use encoding other than UTF-8 until really need it.
We can add "Enable the UTF-8 mode" checkbox to the installer. And we can have "Enable the UTF-8 mode" tool in the start menu. So students don't need to edit the ini file manually. The problem is; should we recommend to enable UTF-8 mode globally by setting environment variable, or provide a per-site UTF-8 mode setting?
They may not want to promote UTF-8 mode until official Python promote UTF-8 mode. So I think venv should support UTF-8 mode first.
Because it solves many real world problem that many Windows users suffer. -- Inada Naoki <songofacandy@gmail.com>

On Wed, 10 Feb 2021 at 11:01, Inada Naoki <songofacandy@gmail.com> wrote:
So in your training course, tell users to activate the environment. Experienced users (like you) who can run scripts directly aren't the target of this change, are they? This is one of the frustrating points here, I'm not clear who the target is. When I say it wouldn't help me, I'm told I'm not the target. When I suggest an alternative, it apparently isn't useful because it wouldn't work for you...
Agreed.
If they only use ASCII files and a system codepage that is the same as ASCII for the first 127 characters, they it's irrelevant. If they read data from a legacy system, that is quite likely to be in the system codepage (most of the local files I use at work, for example, are not UTF-8). So I'd say that many students don't need to learn how to use *any* encoding until they need it. But I'm not a professional trainer, so my experience is limited.
Those options could set the environment variable. After all, that's what "Add Python to PATH" does, and people seem OK with that. No need for an ini file (that adds an extra file read to the startup time, as has already been mentioned as a downside).
What precisely do you mean by "per site"? Do you mean "per Python interpreter"? Do you view separate virtual environments as "sites"? Again, I don't understand who the target audience is here.
That's fair enough. Although I'd like to point out the parallel here - you're saying "environment tools might not want to make UTF8 the default until Python does". I'm saying "Python might not want to make UTF8 the default until the OS does". I'm not completely sure why your argument is stronger than mine :-)
Because it solves many real world problem that many Windows users suffer.
OK. My experience differs, but that's fine. But why wasn't this a consideration when UTF8 mode was first designed? At that point, an interpreter flag and an environment variable were considered sufficient. Why is that no longer true? Is it because the initial design of UTF8 mode ignored Windows? Why, if this is such a Windows-specific problem? Sigh. To be honest, I don't have the time (or the interest) to go back over all the history here. I think I'm just going to have to drop this discussion and wait to comment when a concrete proposal is put forward. PEP 597 is the only actual PEP on the table at the moment, everything else is just speculation, and I really can't keep up with the volume of discussion in the various threads. Paul

On Wed, Feb 10, 2021 at 8:39 PM Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure here. It's not my training course. Target user is thousands of students. They may don't use command prompt at all.
I'm sorry about it. I didn't mean "it don't work for me". I meant just I am not sure activation script is always executed. I looked vscode-python and found it execute the activation script. I am not sure about PyCharm yet, but it works if they works like vscode-python. Another story is clicking .exe files in the Scripts/ directory. But it can be fixed by changing only the launcher exe. Adding per-venv UTF-8 mode is one attractive option. We can keep python.exe untouched.
But students don't know what is ASCII yet.
One installation is one site. One venv is one site. One conda env is one site. I don't know proper term for it, but I call it "site" because all of them have one "site-packages".
Oh, I don't propose changing the default encoding for now. Microsoft provides "Beta: use unicode UTF-8 for worldwide language support in my PC" option. It affects to all application. It is similar to global PYTHONUTF8 environment variable. Microsoft provides UTF-8 code page (*) too. It affects only one application. It is similar to per-site UTF-8 mode idea. (*) https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... So what i am proposing is not more aggressive than Microsoft. Microsoft provides similar options already.
When I accepted the UTF-8 mode, main target is server application. Some Unix server OS (especially "minimal" container images) only have C locale. Since target users are server side programmers, command-line arg and environment variable are enough. I knew UTF-8 mode is interesting for Windows too. But Windows users were not main target when I accepted it. After UTF-8 mode is shipped, I noticed UTF-8 mode is very nice for Windows users who learning Python.
Why, if this is such a Windows-specific problem?
For Unix (macOS, iPadOS, Android, ChromeOS, and Linux) desktop users, they uses UTF-8 locale already. Students can learn Python in "UTF-8 is default" environment. UTF-8 mode is used for server applications running in C locale. Server side programmers are familar with command line and environment variables. On the other hand, Most students learning Python on Windows are not server-side programmer. They are not familar with command line and environment variables. And they are suffered by UnicodeError for now, because the default encoding for text files are not UTF-8. That is the key difference.
I'm sorry about it. I have not chose actual implementation yet so I can not write concrete PEP yet. -- Inada Naoki <songofacandy@gmail.com>

On Wed, 10 Feb 2021 at 13:31, Inada Naoki <songofacandy@gmail.com> wrote:
I'm sorry about it. I have not chose actual implementation yet so I can not write concrete PEP yet.
It's not a problem. I appreciate all of the time you're putting into considering the responses and keeping the discussion going. (And please don't think I was criticising the decision over UTF-8 mode, I genuinely didn't know the background, and "it was targeted at server environments" answers that question for me). I'm dropping out of the discussion because I can't afford the time to make sure I'm not forcing you to go over things that have already been discussed, and I don't want to waste your time by doing that. But I await the results of the discussion with interest :-) Paul

On Wed, Feb 10, 2021 at 12:33 AM Paul Moore <p.f.moore@gmail.com> wrote:
So get PYTHONUTF8 added to the environment activate script. That's a simple change to venv. And virtualenv, and conda
That's probably a good solution for venv and virtualenv -- essentially add it as another environment creation option. but conda, not so much. Conda can manage everything, not just python. YOU can create an conda environment with no Python at all in it. So UTF-8 mode is not really a configuration of the environment itself, but rather a configuration of the Python package. It's also the philosophy of conda that it essentially installs stuff built in the usual way -- conda-build is pretty stupid actually. GRanted, it does indeed have a lot of special case stuff for Python, so this probably could be done, but I'd rather see this kind of Python configuration done in a more friendly manner to third party package managers (anyone know anything about chocolaty, for instance?) -CHB NOTE: I use conda a lot, but DO NOT know all it's ins and outs -- it may be possible to set environment variables with a package install -- given that the word "environment" shows up all over the place in conda docs, Google has not helped me answer that question.
a change to Python's already complex (and slower than many of us would like) startup process.
That IS a real issue, yes :-(
And anyway, would you not have to explain how to set UTF-8 mode for the training environment one way or another anyway?
That's why I'd like "one way to do it" on all platforms -- see other parts of this thread. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 2/11/21, Christopher Barker <pythonchb@gmail.com> wrote:
Note that using a virtual environment does not require activation. A script can be deployed to run in a virtual environment by referring to the environment's executable in a shebang line, e.g.: #!path\to\venv\Scripts\python.exe Or with a Windows shell link that runs path\to\venv\Scripts\python.exe path\to\script.py Setting PYTHONUTF8 in the activate script does nothing to educate users about the default encoding in other contexts. The REPL shell could print a short message at startup that informs the user that Python is using UTF-8 mode, including a link to a web page that explains this in more detail.

I looked some Python courses for children. They won't use venvs. For example, they put .py file in a specified directory, then run it in the Minecraft or other graphical applications. Now I think we should promote putting PYTHONUTF8=1 in user environment before thinking about complex per-site ideas. Since its user environment variable, it won't break legacy applications running in a parent account. Does anyone against adding "Enable UTF-8 mode" in the Start menu? -- Inada Naoki <songofacandy@gmail.com>

On 10.02.2021 08:15, Christopher Barker wrote:
That's fair, but please note that the idea is to have the Python installer take care of setting the env var globally, so no admin or user would need to bother with this.
It's really important to support configuration per environment these days. Ideally with any of the "environment" tools.
True, and those can easily override any globally set env vars. Note that you can set PYTHONUTF8=0 to disable and already globally set PYTHONUTF8=1. conda could manage this on a per env basis. venv could as well, via the .bat or .ps1 files to activate the environment on Windows. So technically, env vars are indeed an easy way to enable UTF-8 mode on a per installation and per venv basis, on all platforms Python supports. What's best: only tooling and installers would need to be adapted, not Python itself, since the UTF-8 mode and env var has already been around for quite some time. And for those who don't want to wait: setx PYTHONUTF8 1 does the trick in an admin command shell on Windows globally. set PYTHONUTF8 1 does the same locally in a user command shell or as part of the venv activate.bat file. It's not really all that hard :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 10 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/10/21, M.-A. Lemburg <mal@egenix.com> wrote:
setx PYTHONUTF8 1
does the trick in an admin command shell on Windows globally.
The above command sets the variable only for the current user, which I'd recommend anyway. It does not require administrator access. To set a machine value, run `setx /M PYTHONUTF8 1`, which of course requires administrator access. Also, run `set PYTHONUTF8=1` in CMD or `$env:PYTHONUTF8=1` in PowerShell to set the variable in the current shell. Unrelated to UTF-8 mode and long-term plans to make UTF-8 the preferred encoding, what I want, from the perspective of writing applications and scripts (not libraries), is a -X option and/or environment variable to make local._get_locale_encoding() behave like it does in POSIX. It should return the LC_CTYPE codeset of the current locale, not just the default locale. This would allow setlocale() in Windows to change the default for encoding=None, just as it does in POSIX. Technically it's not hard to implement in a way that's as reliable as nl_langinfo(CODESET) in POSIX. The code page of the current CRT locale is a public field. In Windows 10 the CRT has supported UTF-8 for 3 years -- regardless of the process active code page returned by GetACP(). Just call setlocale(LC_CTYPE, ".UTF-8") or setlocale(LC_CTYPE, (getdefaultlocale()[0], 'UTF-8')).

On 10.02.2021 23:10, Eryk Sun wrote:
Thanks for the correction.
That's what getlocale(LC_CTYPE) is intended for, unless I'm missing something. getdefaultlocale(), which uses _locale._getdefaultlocale() on Windows, is meant to determine the locale settings, setlocale(locale.LC_ALL, '') would be setting for the current process, without actually doing this. The reason we have this API is because setlocale() is not thread-safe and could therefore cause problems in other threads when simply trying to call setlocale(locale.LC_ALL, '') and then reset this again if needed.
I think the main problem here is that open() doesn't use locale.getlocale()[1] as default for the encoding parameter, but instead locale.getpreferredencoding(False). The latter doesn't change when you adjust the locale for the current process on Windows:
On Linux, locale.getpreferredencoding(False) does return changes made using setlocale(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 11 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
Currently, locale.getpreferredencoding(False) is implemented as locale._get_locale_encoding(). This ultimately calls _Py_GetLocaleEncoding(), defined in "Python/fileutils.c". TextIOWrapper() calls this C function to get the encoding to use when encoding=None is passed. In POSIX, _Py_GetLocaleEncoding() calls nl_langinfo(CODESET), which returns the current LC_CTYPE encoding, not the default LC_CTYPE encoding. For example, in Linux: >>> setlocale(LC_CTYPE, 'en_US.UTF-8') 'en_US.UTF-8' >>> _get_locale_encoding() 'UTF-8' >>> open('test.txt').encoding 'UTF-8' >>> setlocale(LC_CTYPE, 'en_US.ISO-8859-1') 'en_US.ISO-8859-1' >>> _get_locale_encoding() 'ISO-8859-1' >>> open('test.txt').encoding 'ISO-8859-1' In Windows, _Py_GetLocaleEncoding() just uses GetACP(), which returns the process ANSI code page. This is based on the CRT's default locale set by setlocale(LC_CTYPE, ""), which combines the user's default locale with the process ANSI code page. I'm not overjoyed about this combination in the default locale, since it's potentially inconsistent (e.g. Korean user locale with Latin 1252 process code page), but that ship sailed a long time ago. I'm not arguing to change locale.getdefaultlocale(). The problem is that locale._get_locale_encoding() in Windows is not returning the current LC_CTYPE locale encoding, in contrast to how it behaves in POSIX. I'd like an environment variable and/or -X option to fix this flaw. If enabled, and if the C runtime supports UTF-8 locales (as it has for the past 3 years in Windows 10), and the application warrants it (e.g. many open calls across many modules), then convenient use of UTF-8 would be one setlocale() call away. It's not for packages. Frankly, I don't see why it's a problem for a package developer to use encoding='utf-8' for files that need to use UTF-8. Developing libraries that are designed to work in arbitrary applications on multiple platforms is tedious work. Having to explicitly pass encoding='utf-8' goes with the territory, and it's a minor annoyance in the grand scheme of things.
That's what getlocale(LC_CTYPE) is intended for, unless I'm missing something.
getlocale() can't be relied on to parse the correct codeset from the locale name, and it can even raise ValueError (more likely in Windows, e.g. with the native locale name "en-US"). The codeset should be queried directly using an API call, such as nl_langinfo(CODESET) in POSIX. In Windows, the C runtime's POSIX locale implementation doesn't include nl_langinfo(). There's ___lc_codepage_func(), but it's documented as an internal function. A ucrt locale record, however, does expose the code page as a public field, as documented in the public header "corecrt.h". Here's a prototype using ctypes: import os import ctypes ucrt = ctypes.CDLL('ucrtbase', use_errno=True) class _crt_locale_data_public(ctypes.Structure): _fields_ = (('_locale_pctype', ctypes.POINTER(ctypes.c_ushort)), ('_locale_mb_cur_max', ctypes.c_int), ('_locale_lc_codepage', ctypes.c_uint)) class _crt_locale_pointers(ctypes.Structure): _fields_ = (('locinfo', ctypes.POINTER(_crt_locale_data_public)), ('mbcinfo', ctypes.c_void_p)) ucrt._get_current_locale.restype = ctypes.POINTER(_crt_locale_pointers) CP_UTF8 = 65001 def _get_locale_encoding(): locale = ucrt._get_current_locale() if not locale: errno = ctypes.get_errno() raise OSError(errno, os.strerror(errno)) try: codepage = locale[0].locinfo[0]._locale_lc_codepage finally: ucrt._free_locale(locale) if codepage == 0: return 'latin-1' # "C" locale if codepage == CP_UTF8: return 'utf-8' return f'cp{cp}' Examples with Python 3.9 in Windows 10: >>> setlocale(LC_CTYPE, 'C') 'C' >>> _get_locale_encoding() 'latin-1' >>> setlocale(LC_CTYPE, 'en_US') 'en_US' >>> _get_locale_encoding() 'cp1252' >>> setlocale(LC_CTYPE, 'el_GR') 'el_GR' >>> _get_locale_encoding() 'cp1253' >>> setlocale(LC_CTYPE, 'en_US.utf-8') 'en_US.utf-8' >>> _get_locale_encoding() 'utf-8'

On 11.02.2021 13:49, Eryk Sun wrote:
All that seems to be new in Python 3.10. This is not what's happening in Python 3.9. The _get_locale_encoding() function doesn't even exist.
Why an env variable ? You could simply open up a ticket to get this fixed, since 3.10 is not released yet.
Here's what I get with Python 3.9 on Windows 10:
Note that _get_locale_encoding() is not available, so using getlocale() instead. The returned values for the encoding look mostly correct to me, except the one for the 'C' locale which should be 'ascii'. getpreferredencoding() doesn't honor those changes, though. It returns 'cp1252' for me, or 'UTF-8' when using UTF-8 mode. Now, if I explicitly set the locale, I'd expect this to be used by Python for I/O as well. This currently doesn't happen and that's confusing:
Anyway, UTF-8 mode is the way to go these days, esp. if you want to write applications which are portable across platforms and behave the same on all. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 11 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
In previous versions, locale.getpreferredencoding(False) is functionally the same. In 3.10, the latter is implemented in C via locale._get_locale_encoding().
Why an env variable ? You could simply open up a ticket to get this fixed, since 3.10 is not released yet.
I thought it would be best to let users/administrators opt in to POSIX behavior. But maybe it should require opting out.
Windows code pages 1252 and 1253 are not the same as ISO-8859-1 and ISO-8859-7. getlocale() is just looking up the encoding of "en_US" and "el_GR" from the mapping in the locale module. That kind of best-guess result isn't right for locale._get_locale_encoding().
The returned values for the encoding look mostly correct to me, except the one for the 'C' locale which should be 'ascii'.
The "C" locale in the Windows CRT uses Latin-1 for LC_CTYPE. This is implemented for mbstowcs() by casting from char to wchar_t. It's similar for wcstombs(), and limited to Unicode ordinals below 256. However, the "C" locale isn't consistently Latin-1 across other categories. IIRC, LC_TIME in the "C" locale uses the process ANSI code page for time-zone names, and mojibake is common.
Globally setting PYTHONUTF8 forces all scripts to use UTF-8 as the default for open(). I'd like to let scripts opt in to using UTF-8 as the default for open() by way of an explicit setlocale() call such as setlocale(LC_CTYPE, (getdefaultlocale()[0], "UTF-8")) or, Windows only, setlocale(LC_CTYPE, ".UTF-8"). In POSIX, Python already tries coercing the "C" and "POSIX" locales (usually ASCII) to use UTF-8.

On 2/9/21, Inada Naoki <songofacandy@gmail.com> wrote:
Users can simply create a shortcut that targets `cmd /k set PYTHONUTF8=1`. Optionally change the shortcut's "start in" directory to the desired working directory.
Command-line modification of the persistent environment is rarely required. Using setx.exe is okay for setting simple variables in CMD [1], such as `setx PYTHONUTF8 1`, combined with `set PYTHONUTF8=1` for the current shell. To do this in the GUI in Windows 10, click on the start button (or tap the WIN key) to show the start menu; type "environ"; and click on "Edit environment variables for your account". In the window that opens, click the "New" button; type "PYTHONUTF8" as the name and "1" (without quotes) as the value. Click the "OK" button on the dialog, and then click the "OK" button on the editor window. To test the value, assuming you have the py launcher installed, press WIN+R to open the run dialog. Type "py", and in the Python shell confirm that executing `import locale; locale.getpreferredencoding()` returns 'UTF-8'. --- [1] I would feel remiss in discussing "setx.exe" without warning about naively trying to modify PATH. For example, DO NOT execute a command like `setx.exe PATH "C:\Program Files\Python39;%PATH%"`. This is wrong because it sets the current PATH value, including the system part, as the user "Path" value, truncated to 1024 characters, and without the original dependence on system variables and independent (REG_SZ) user variables. Properly modifying the persistent "Path" from CMD is difficult and requires careful use of both reg.exe and setx.exe. It's easier in PowerShell. It's far easier to use the GUI editor, which in Windows 10 even provides an exploded list view that makes it simple to add/remove directories and move them up and down in the list.

On Tue, Feb 9, 2021 at 12:28 AM Inada Naoki <songofacandy@gmail.com> wrote:
yes and no -- if we don't anticipate supporting Unix, then we may well come up with a solution that won't work well there. And if we do have a solution that will work well there, whynot turn it on? But anyway, you are quite right that it's a very narrow use case, so if you think it'll really hold things up then we can abandon that.
Indeed, that is kind of the point of Docker :-) But the same issue could (though less likely) comeup for other *nix deployments -- I'd still like to have one place to specify what my Python application needs to run. Not a huge deal, but would be good. Without more concrete idea, such rough lead this thread to maze.
Well, I've been trying to get help with a more concrete idea in this thread. That's why I've specified what I think are the requirements -- we can't have a solution without agreeing on the requirements first. I'm still not sure if the requirement to make it easily installable into an environment without an extra step hasn't been discussed because it's technically impossible / difficult, or if no one else thinks it's worth doing at all. I guess either way, it's time to abandon the idea.
Note that UTF-8 mode must be enabled before any path config on Unix. So it is almost impossible to enable UTF-8 mode using tools like pip.
That is the challenge, yes. But can pip put files outside of site-packages? I suspect not.
That is my idea -- at least for conda. But would it cause any more conflict than installing a package of a particular version? But this might be a case for using the pyenv.cfg file -- that IS intended to be manipulated by the environment tool. Though yes, having it looked for outside of the dir where python lives is not good. Have you arrived at a concrete proposal at this point? -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
participants (5)
-
Christopher Barker
-
Eryk Sun
-
Inada Naoki
-
M.-A. Lemburg
-
Paul Moore