Python-Dev
Threads by month
- ----- 2024 -----
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2004 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2003 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2002 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2001 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2000 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 1999 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
April 2017
- 39 participants
- 33 discussions
I'm back, I've re-read the PEP, and I've re-read the long thread with "(no
subject)".
I think Georg Brandl nailed it:
"""
*I like the "sequence and dict flattening" part of the PEP, mostly because
itis consistent and should be easy to understand, but the comprehension
syntaxenhancements seem to be bad for readability and "comprehending" what
the codedoes.The call syntax part is a mixed bag on the one hand it is nice
to be consistent with the extended possibilities in literals (flattening),
but on the other hand there would be small but annoying inconsistencies
anyways (e.g. the duplicate kwarg case above).*
"""
Greg Ewing followed up explaining that the inconsistency between dict
flattening and call syntax is inherent in the pre-existing different rules
for dicts vs. keyword args: {'a':1, 'a':2} results in {'a':2}, while f(a=1,
a=2) is an error. (This form is a SyntaxError; the dynamic case f(a=1,
**{'a': 1}) is a TypeError.)
For me, allowing f(*a, *b) and f(**d, **e) and all the other combinations
for function calls proposed by the PEP is an easy +1 -- it's a
straightforward extension of the existing pattern, and anybody who knows
what f(x, *a) does will understand f(x, *a, y, *b). Guessing what f(**d,
**e) means shouldn't be hard either. Understanding the edge case for
duplicate keys with f(**d, **e) is a little harder, but the error messages
are pretty clear, and it is not a new edge case.
The sequence and dict flattening syntax proposals are also clean and
logical -- we already have *-unpacking on the receiving side, so allowing
*x in tuple expressions reads pretty naturally (and the similarity with *a
in argument lists certainly helps). From here, having [a, *x, b, *y] is
also natural, and then the extension to other displays is natural: {a, *x,
b, *y} and {a:1, **d, b:2, **e}. This, too, gets a +1 from me.
So that leaves comprehensions. IIRC, during the development of the patch we
realized that f(*x for x in xs) is sufficiently ambiguous that we decided
to disallow it -- note that f(x for x in xs) is already somewhat of a
special case because an argument can only be a "bare" generator expression
if it is the only argument. The same reasoning doesn't apply (in that form)
to list, set and dict comprehensions -- while f(x for x in xs) is identical
in meaning to f((x for x in xs)), [x for x in xs] is NOT the same as [(x
for x in xs)] (that's a list of one element, and the element is a generator
expression).
The basic premise of this part of the proposal is that if you have a few
iterables, the new proposal (without comprehensions) lets you create a list
or generator expression that iterates over all of them, essentially
flattening them:
>>> xs = [1, 2, 3]
>>> ys = ['abc', 'def']
>>> zs = [99]
>>> [*xs, *ys, *zs]
[1, 2, 3, 'abc', 'def', 99]
>>>
But now suppose you have a list of iterables:
>>> xss = [[1, 2, 3], ['abc', 'def'], [99]]
>>> [*xss[0], *xss[1], *xss[2]]
[1, 2, 3, 'abc', 'def', 99]
>>>
Wouldn't it be nice if you could write the latter using a comprehension?
>>> xss = [[1, 2, 3], ['abc', 'def'], [99]]
>>> [*xs for xs in xss]
[1, 2, 3, 'abc', 'def', 99]
>>>
This is somewhat seductive, and the following is even nicer: the *xs
position may be an expression, e.g.:
>>> xss = [[1, 2, 3], ['abc', 'def'], [99]]
>>> [*xs[:2] for xs in xss]
[1, 2, 'abc', 'def', 99]
>>>
On the other hand, I had to explore the possibilities here by experimenting
in the interpreter, and I discovered some odd edge cases (e.g. you can
parenthesize the starred expression, but that seems a syntactic accident).
All in all I am personally +0 on the comprehension part of the PEP, and I
like that it provides a way to "flatten" a sequence of sequences, but I
think very few people in the thread have supported this part. Therefore I
would like to ask Neil to update the PEP and the patch to take out the
comprehension part, so that the two "easy wins" can make it into Python 3.5
(basically, I am accepting two-thirds of the PEP :-). There is some time
yet until alpha 2.
I would also like code reviewers (Benjamin?) to start reviewing the patch
<http://bugs.python.org/issue2292>, taking into account that the
comprehension part needs to be removed.
--
--Guido van Rossum (python.org/~guido)
7
17
It has been a while since I posted a copy of PEP 1 to the mailing
lists and newsgroups. I've recently done some updating of a few
sections, so in the interest of gaining wider community participation
in the Python development process, I'm posting the latest revision of
PEP 1 here. A version of the PEP is always available on-line at
http://www.python.org/peps/pep-0001.html
Enjoy,
-Barry
-------------------- snip snip --------------------
PEP: 1
Title: PEP Purpose and Guidelines
Version: $Revision: 1.36 $
Last-Modified: $Date: 2002/07/29 18:34:59 $
Author: Barry A. Warsaw, Jeremy Hylton
Status: Active
Type: Informational
Created: 13-Jun-2000
Post-History: 21-Mar-2001, 29-Jul-2002
What is a PEP?
PEP stands for Python Enhancement Proposal. A PEP is a design
document providing information to the Python community, or
describing a new feature for Python. The PEP should provide a
concise technical specification of the feature and a rationale for
the feature.
We intend PEPs to be the primary mechanisms for proposing new
features, for collecting community input on an issue, and for
documenting the design decisions that have gone into Python. The
PEP author is responsible for building consensus within the
community and documenting dissenting opinions.
Because the PEPs are maintained as plain text files under CVS
control, their revision history is the historical record of the
feature proposal[1].
Kinds of PEPs
There are two kinds of PEPs. A standards track PEP describes a
new feature or implementation for Python. An informational PEP
describes a Python design issue, or provides general guidelines or
information to the Python community, but does not propose a new
feature. Informational PEPs do not necessarily represent a Python
community consensus or recommendation, so users and implementors
are free to ignore informational PEPs or follow their advice.
PEP Work Flow
The PEP editor, Barry Warsaw <peps(a)python.org>, assigns numbers
for each PEP and changes its status.
The PEP process begins with a new idea for Python. It is highly
recommended that a single PEP contain a single key proposal or new
idea. The more focussed the PEP, the more successfully it tends
to be. The PEP editor reserves the right to reject PEP proposals
if they appear too unfocussed or too broad. If in doubt, split
your PEP into several well-focussed ones.
Each PEP must have a champion -- someone who writes the PEP using
the style and format described below, shepherds the discussions in
the appropriate forums, and attempts to build community consensus
around the idea. The PEP champion (a.k.a. Author) should first
attempt to ascertain whether the idea is PEP-able. Small
enhancements or patches often don't need a PEP and can be injected
into the Python development work flow with a patch submission to
the SourceForge patch manager[2] or feature request tracker[3].
The PEP champion then emails the PEP editor <peps(a)python.org> with
a proposed title and a rough, but fleshed out, draft of the PEP.
This draft must be written in PEP style as described below.
If the PEP editor approves, he will assign the PEP a number, label
it as standards track or informational, give it status 'draft',
and create and check-in the initial draft of the PEP. The PEP
editor will not unreasonably deny a PEP. Reasons for denying PEP
status include duplication of effort, being technically unsound,
not providing proper motivation or addressing backwards
compatibility, or not in keeping with the Python philosophy. The
BDFL (Benevolent Dictator for Life, Guido van Rossum) can be
consulted during the approval phase, and is the final arbitrator
of the draft's PEP-ability.
If a pre-PEP is rejected, the author may elect to take the pre-PEP
to the comp.lang.python newsgroup (a.k.a. python-list(a)python.org
mailing list) to help flesh it out, gain feedback and consensus
from the community at large, and improve the PEP for
re-submission.
The author of the PEP is then responsible for posting the PEP to
the community forums, and marshaling community support for it. As
updates are necessary, the PEP author can check in new versions if
they have CVS commit permissions, or can email new PEP versions to
the PEP editor for committing.
Standards track PEPs consists of two parts, a design document and
a reference implementation. The PEP should be reviewed and
accepted before a reference implementation is begun, unless a
reference implementation will aid people in studying the PEP.
Standards Track PEPs must include an implementation - in the form
of code, patch, or URL to same - before it can be considered
Final.
PEP authors are responsible for collecting community feedback on a
PEP before submitting it for review. A PEP that has not been
discussed on python-list(a)python.org and/or python-dev(a)python.org
will not be accepted. However, wherever possible, long open-ended
discussions on public mailing lists should be avoided. Strategies
to keep the discussions efficient include, setting up a separate
SIG mailing list for the topic, having the PEP author accept
private comments in the early design phases, etc. PEP authors
should use their discretion here.
Once the authors have completed a PEP, they must inform the PEP
editor that it is ready for review. PEPs are reviewed by the BDFL
and his chosen consultants, who may accept or reject a PEP or send
it back to the author(s) for revision.
Once a PEP has been accepted, the reference implementation must be
completed. When the reference implementation is complete and
accepted by the BDFL, the status will be changed to `Final.'
A PEP can also be assigned status `Deferred.' The PEP author or
editor can assign the PEP this status when no progress is being
made on the PEP. Once a PEP is deferred, the PEP editor can
re-assign it to draft status.
A PEP can also be `Rejected'. Perhaps after all is said and done
it was not a good idea. It is still important to have a record of
this fact.
PEPs can also be replaced by a different PEP, rendering the
original obsolete. This is intended for Informational PEPs, where
version 2 of an API can replace version 1.
PEP work flow is as follows:
Draft -> Accepted -> Final -> Replaced
^
+----> Rejected
v
Deferred
Some informational PEPs may also have a status of `Active' if they
are never meant to be completed. E.g. PEP 1.
What belongs in a successful PEP?
Each PEP should have the following parts:
1. Preamble -- RFC822 style headers containing meta-data about the
PEP, including the PEP number, a short descriptive title
(limited to a maximum of 44 characters), the names, and
optionally the contact info for each author, etc.
2. Abstract -- a short (~200 word) description of the technical
issue being addressed.
3. Copyright/public domain -- Each PEP must either be explicitly
labelled as placed in the public domain (see this PEP as an
example) or licensed under the Open Publication License[4].
4. Specification -- The technical specification should describe
the syntax and semantics of any new language feature. The
specification should be detailed enough to allow competing,
interoperable implementations for any of the current Python
platforms (CPython, JPython, Python .NET).
5. Motivation -- The motivation is critical for PEPs that want to
change the Python language. It should clearly explain why the
existing language specification is inadequate to address the
problem that the PEP solves. PEP submissions without
sufficient motivation may be rejected outright.
6. Rationale -- The rationale fleshes out the specification by
describing what motivated the design and why particular design
decisions were made. It should describe alternate designs that
were considered and related work, e.g. how the feature is
supported in other languages.
The rationale should provide evidence of consensus within the
community and discuss important objections or concerns raised
during discussion.
7. Backwards Compatibility -- All PEPs that introduce backwards
incompatibilities must include a section describing these
incompatibilities and their severity. The PEP must explain how
the author proposes to deal with these incompatibilities. PEP
submissions without a sufficient backwards compatibility
treatise may be rejected outright.
8. Reference Implementation -- The reference implementation must
be completed before any PEP is given status 'Final,' but it
need not be completed before the PEP is accepted. It is better
to finish the specification and rationale first and reach
consensus on it before writing code.
The final implementation must include test code and
documentation appropriate for either the Python language
reference or the standard library reference.
PEP Template
PEPs are written in plain ASCII text, and should adhere to a
rigid style. There is a Python script that parses this style and
converts the plain text PEP to HTML for viewing on the web[5].
PEP 9 contains a boilerplate[7] template you can use to get
started writing your PEP.
Each PEP must begin with an RFC822 style header preamble. The
headers must appear in the following order. Headers marked with
`*' are optional and are described below. All other headers are
required.
PEP: <pep number>
Title: <pep title>
Version: <cvs version string>
Last-Modified: <cvs date string>
Author: <list of authors' real names and optionally, email addrs>
* Discussions-To: <email address>
Status: <Draft | Active | Accepted | Deferred | Final | Replaced>
Type: <Informational | Standards Track>
* Requires: <pep numbers>
Created: <date created on, in dd-mmm-yyyy format>
* Python-Version: <version number>
Post-History: <dates of postings to python-list and python-dev>
* Replaces: <pep number>
* Replaced-By: <pep number>
The Author: header lists the names and optionally, the email
addresses of all the authors/owners of the PEP. The format of the
author entry should be
address(a)dom.ain (Random J. User)
if the email address is included, and just
Random J. User
if the address is not given. If there are multiple authors, each
should be on a separate line following RFC 822 continuation line
conventions. Note that personal email addresses in PEPs will be
obscured as a defense against spam harvesters.
Standards track PEPs must have a Python-Version: header which
indicates the version of Python that the feature will be released
with. Informational PEPs do not need a Python-Version: header.
While a PEP is in private discussions (usually during the initial
Draft phase), a Discussions-To: header will indicate the mailing
list or URL where the PEP is being discussed. No Discussions-To:
header is necessary if the PEP is being discussed privately with
the author, or on the python-list or python-dev email mailing
lists. Note that email addresses in the Discussions-To: header
will not be obscured.
Created: records the date that the PEP was assigned a number,
while Post-History: is used to record the dates of when new
versions of the PEP are posted to python-list and/or python-dev.
Both headers should be in dd-mmm-yyyy format, e.g. 14-Aug-2001.
PEPs may have a Requires: header, indicating the PEP numbers that
this PEP depends on.
PEPs may also have a Replaced-By: header indicating that a PEP has
been rendered obsolete by a later document; the value is the
number of the PEP that replaces the current document. The newer
PEP must have a Replaces: header containing the number of the PEP
that it rendered obsolete.
PEP Formatting Requirements
PEP headings must begin in column zero and the initial letter of
each word must be capitalized as in book titles. Acronyms should
be in all capitals. The body of each section must be indented 4
spaces. Code samples inside body sections should be indented a
further 4 spaces, and other indentation can be used as required to
make the text readable. You must use two blank lines between the
last line of a section's body and the next section heading.
You must adhere to the Emacs convention of adding two spaces at
the end of every sentence. You should fill your paragraphs to
column 70, but under no circumstances should your lines extend
past column 79. If your code samples spill over column 79, you
should rewrite them.
Tab characters must never appear in the document at all. A PEP
should include the standard Emacs stanza included by example at
the bottom of this PEP.
A PEP must contain a Copyright section, and it is strongly
recommended to put the PEP in the public domain.
When referencing an external web page in the body of a PEP, you
should include the title of the page in the text, with a
footnote reference to the URL. Do not include the URL in the body
text of the PEP. E.g.
Refer to the Python Language web site [1] for more details.
...
[1] http://www.python.org
When referring to another PEP, include the PEP number in the body
text, such as "PEP 1". The title may optionally appear. Add a
footnote reference that includes the PEP's title and author. It
may optionally include the explicit URL on a separate line, but
only in the References section. Note that the pep2html.py script
will calculate URLs automatically, e.g.:
...
Refer to PEP 1 [7] for more information about PEP style
...
References
[7] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
http://www.python.org/peps/pep-0001.html
If you decide to provide an explicit URL for a PEP, please use
this as the URL template:
http://www.python.org/peps/pep-xxxx.html
PEP numbers in URLs must be padded with zeros from the left, so as
to be exactly 4 characters wide, however PEP numbers in text are
never padded.
Reporting PEP Bugs, or Submitting PEP Updates
How you report a bug, or submit a PEP update depends on several
factors, such as the maturity of the PEP, the preferences of the
PEP author, and the nature of your comments. For the early draft
stages of the PEP, it's probably best to send your comments and
changes directly to the PEP author. For more mature, or finished
PEPs you may want to submit corrections to the SourceForge bug
manager[6] or better yet, the SourceForge patch manager[2] so that
your changes don't get lost. If the PEP author is a SF developer,
assign the bug/patch to him, otherwise assign it to the PEP
editor.
When in doubt about where to send your changes, please check first
with the PEP author and/or PEP editor.
PEP authors who are also SF committers, can update the PEPs
themselves by using "cvs commit" to commit their changes.
Remember to also push the formatted PEP text out to the web by
doing the following:
% python pep2html.py -i NUM
where NUM is the number of the PEP you want to push out. See
% python pep2html.py --help
for details.
Transferring PEP Ownership
It occasionally becomes necessary to transfer ownership of PEPs to
a new champion. In general, we'd like to retain the original
author as a co-author of the transferred PEP, but that's really up
to the original author. A good reason to transfer ownership is
because the original author no longer has the time or interest in
updating it or following through with the PEP process, or has
fallen off the face of the 'net (i.e. is unreachable or not
responding to email). A bad reason to transfer ownership is
because you don't agree with the direction of the PEP. We try to
build consensus around a PEP, but if that's not possible, you can
always submit a competing PEP.
If you are interested assuming ownership of a PEP, send a message
asking to take over, addressed to both the original author and the
PEP editor <peps(a)python.org>. If the original author doesn't
respond to email in a timely manner, the PEP editor will make a
unilateral decision (it's not like such decisions can be
reversed. :).
References and Footnotes
[1] This historical record is available by the normal CVS commands
for retrieving older revisions. For those without direct access
to the CVS tree, you can browse the current and past PEP revisions
via the SourceForge web site at
http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/nondist/peps/?cvsroot=…
[2] http://sourceforge.net/tracker/?group_id=5470&atid=305470
[3] http://sourceforge.net/tracker/?atid=355470&group_id=5470&func=browse
[4] http://www.opencontent.org/openpub/
[5] The script referred to here is pep2html.py, which lives in
the same directory in the CVS tree as the PEPs themselves.
Try "pep2html.py --help" for details.
The URL for viewing PEPs on the web is
http://www.python.org/peps/
[6] http://sourceforge.net/tracker/?group_id=5470&atid=305470
[7] PEP 9, Sample PEP Template
http://www.python.org/peps/pep-0009.html
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End:
8
14
congrats on 3.5! Alas, windows 7 users are having problems installing it
by Laura Creighton 15 Sep '19
by Laura Creighton 15 Sep '19
15 Sep '19
webmaster has already heard from 4 people who cannot install it.
I sent them to the bug tracker or to python-list but they seem
not to have gone either place. Is there some guide I should be
sending them to, 'how to debug installation problems'?
Laura
5
6
Hello all,
Over in Ubuntu, we've gotten reports about some performance regressions in
Python 2.7 when moving from Trusty (14.04 LTS) to Xenial (16.04 LTS).
Trusty's version is based on 2.7.6 while Xenial's version is based on 2.7.12
with bits of .13 cherry picked.
We've not been able to identify any change in Python itself (or the
Debian/Ubuntu deltas) which could account for this, so the investigation has
led to various gcc compiler options and version differences. In particular
disabling LTO (link-time optimization) seems to have a positive impact, but
doesn't completely regain the loss.
Louis (Cc'd here) has done a ton of work to measure and analyze the problem,
but we've more or less hit a roadblock, so we're taking the issue public to
see if anybody on this mailing list has further ideas. A detailed analysis is
available in this Google doc:
https://docs.google.com/document/d/1zrV3OIRSo99fd2Ty4YdGk_scmTRDmVauBprKL8e…
The document should be public for comments and editing.
If you have any thoughts, or other lines of investigation you think are
worthwhile pursuing, please add your comments to the document.
Cheers,
-Barry
11
21
Python FTP Injections Allow for Firewall Bypass (oss-security advisory)
by nospam@curso.re 20 Jun '17
by nospam@curso.re 20 Jun '17
20 Jun '17
Hello,
I have just noticed that an FTP injection advisory has been made public
on the oss-security list.
The author says that he an exploit exists but it won't be published
until the code is patched
You may be already aware, but it would be good to understand what is the
position of the core developers about this.
The advisory is linked below (with some excerpts in this message):
http://blog.blindspotsecurity.com/2017/02/advisory-javapython-ftp-injection…
Protocol injection flaws like this have been an area of research of mine
for the past few couple of years and as it turns out, this FTP protocol
injection allows one to fool a victim's firewall into allowing TCP
connections from the Internet to the vulnerable host's system on any
"high" port (1024-65535). A nearly identical vulnerability exists in
Python's urllib2 and urllib libraries. In the case of Java, this attack
can be carried out against desktop users even if those desktop users do
not have the Java browser plugin enabled.
As of 2017-02-20, the vulnerabilities discussed here have not been patched
by the associated vendors, despite advance warning and ample time to do
so.
[...]
Python's built-in URL fetching library (urllib2 in Python 2 and urllib in
Python 3) is vulnerable to a nearly identical protocol stream injection,
but this injection appears to be limited to attacks via directory names
specified in the URL.
[...]
The Python security team was notified in January 2016. Information
provided included an outline of the possibility of FTP/firewall attacks.
Despite repeated follow-ups, there has been no apparent action on their
part.
Best regards,
-- Stefano
P.S.
I am posting from gmane, I hope that this is OK.
9
9
16 Jun '17
Hi
I would like to change struct.Struct.format type from bytes to str. I
don't expect that anyone uses this attribute, and struct.Struct()
constructor accepts both bytes and str.
http://bugs.python.org/issue21071
It's just to be convenient: more functions accept str than bytes in
Python 3. Example: print() (python3 -bb raises an exceptions if you
pass bytes to print).
Is anything opposed to breaking the backward compatibility?
Victor
3
3
Hi.
Here's PEP 545, ready to be reviewed!
This is the follow-up to the "PEP: Python Documentation Translations" thread on python-ideas [1]_,
itself a follow-up of the "Translated Python documentation" thread on python-dev [2]_.
This PEP describes the steps to make existing and future translations of the Python documentation official and accessible on docs.python.org.
You can read it rendered here https://www.python.org/dev/peps/pep-0545/ or inline here, I'll happily get your feedback.
========================================
PEP: 545
Title: Python Documentation Translations
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner(a)gmail.com>,
Inada Naoki <songofacandy(a)gmail.com>,
Julien Palard <julien(a)palard.fr>
Status: Draft
Type: Process
Content-Type: text/x-rst
Created: 04-Mar-2017
Abstract
========
The intent of this PEP is to make existing translations of the Python
Documentation more accessible and discoverable. By doing so, we hope
to attract and motivate new translators and new translations.
Translated documentation will be hosted on python.org. Examples of
two active translation teams:
* http://docs.python.org/fr/: French
* http://docs.python.org/ja/: Japanese
http://docs.python.org/en/ will redirect to http://docs.python.org/.
Sources of translated documentation will be hosted in the Python
organization on GitHub: https://github.com/python/. Contributors will
have to sign the Python Contributor Agreement (CLA) and the license
will be the PSF License.
Motivation
==========
On the French ``#python-fr`` IRC channel on freenode, it's not rare to
meet people who don't speak English and so are unable to read the
Python official documentation. Python wants to be widely available
to all users in any language: this is also why Python 3 supports
any non-ASCII identifiers:
https://www.python.org/dev/peps/pep-3131/#rationale
There are a least 3 groups of people who are translating the Python
documentation to their native language (French [16]_ [17]_ [18]_,
Japanese [19]_ [20]_, Spanish [21]_) even though their translations
are not visible on d.p.o. Other, less visible and less organized
groups, are also translating the documentation, we've heard of Russian
[26]_, Chinese and Korean. Others we haven't found yet might also
exist. This PEP defines rules describing how to move translations on
docs.python.org so they can easily be found by developers, newcomers
and potential translators.
The Japanese team has (as of March 2017) translated ~80% of the
documentation, the French team ~20%. French translation went from 6%
to 23% in 2016 [13]_ with 7 contributors [14]_, proving a translation
team can be faster than the rate the documentation mutates.
Quoting Xiang Zhang about Chinese translations:
I have seen several groups trying to translate part of our official
doc. But their efforts are disperse and quickly become lost because
they are not organized to work towards a single common result and
their results are hold anywhere on the Web and hard to find. An
official one could help ease the pain.
Rationale
=========
Translation
-----------
Issue tracker
'''''''''''''
Considering that issues opened about translations may be written in
the translation language, which can be considered noise but at least
is inconsistent, issues should be placed outside `bugs.python.org
<https://bugs.python.org/>`_ (b.p.o).
As all translation must have their own github project (see `Repository
for Po Files`_), they must use the associated github issue tracker.
Considering the noise induced by translation issues redacted in any
languages which may beyond every warnings land in b.p.o, triage will
have to be done. Considering that translations already exist and are
not actually a source of noise in b.p.o, an unmanageable amount of
work is not to be expected. Considering that Xiang Zhang and Victor
Stinner are already triaging, and Julien Palard is willing to help on
this task, noise on b.p.o is not to be expected.
Also, language team coordinators (see `Language Team`_) should help
with triaging b.p.o by properly indicating, in the language of the
issue author if required, the right issue tracker.
Branches
''''''''
Translation teams should focus on last stable versions, and use tools
(scripts, translation memory, …) to automatically translate what is
done in one branch to other branches.
.. note::
Translation memories are a kind of database of previously translated
paragraphs, even removed ones. See also `Sphinx Internationalization
<http://www.sphinx-doc.org/en/stable/intl.html>`_.
The three currently stable branches that will be translated are [12]_:
2.7, 3.5, and 3.6. The scripts to build the documentation of older
branches needs to be modified to support translation [12]_, whereas
these branches now only accept security-only fixes.
The development branch (master) should have a lower translation priority
than stable branches. But docsbuild-scripts should build it anyway so
it is possible for a team to work on it to be ready for the next
release.
Hosting
-------
Domain Name, Content negotiation and URL
''''''''''''''''''''''''''''''''''''''''
Different translations can be identified by changing one of the
following: Country Code Top Level Domain (CCTLD),
path segment, subdomain or content negotiation.
Buying a CCTLD for each translations is expensive, time-consuming, and
sometimes almost impossible when already registered, this solution
should be avoided.
Using subdomains like "es.docs.python.org" or "docs.es.python.org" is
possible but confusing ("is it `es.docs.python.org` or
`docs.es.python.org`?"). Hyphens in subdomains like
`pt-br.doc.python.org` is uncommon and SEOMoz [23]_ correlated the
presence of hyphens as a negative factor. Usage of underscores in
subdomain is prohibited by the RFC1123 [24]_, section 2.1. Finally,
using subdomains means creating TLS certificates for each
language. This not only requires more maintenance but will also cause
issues in language switcher if, as for version switcher, we want a
preflight to check if the translation exists in the given version:
preflight will probably be blocked by same-origin-policy. Wildcard
TLS certificates are very expensive.
Using content negotiation (HTTP headers ``Accept-Language`` in the
request and ``Vary: Accept-Language``) leads to a bad user experience
where they can't easily change the language. According to Mozilla:
"This header is a hint to be used when the server has no way of
determining the language via another way, like a specific URL, that is
controlled by an explicit user decision." [25]_. As we want to be
able to easily change the language, we should not use the content
negotiation as a main language determination, so we need something
else.
Last solution is to use the URL path, which looks readable, allows
for an easy switch from a language to another, and nicely accepts
hyphens. Typically something like: "docs.python.org/de/" or, by
using a hyphen: "docs.python.org/pt-BR/".
As for the version, sphinx-doc does not support compiling for multiple
languages, so we'll have full builds rooted under a path, exactly like
we're already doing with versions.
So we can have "docs.python.org/de/3.6/" or
"docs.python.org/3.6/de/". A question that arises is:
"Does the language contain multiple versions or does the version contain
multiple languages?". As versions exist in any case and translations
for a given version may or may not exist, we may prefer
"docs.python.org/3.6/de/", but doing so scatters languages everywhere.
Having "/de/3.6/" is clearer, meaning: "everything under /de/ is written
in German". Having the version at the end is also a habit taken by
readers of the documentation: they like to easily change the version
by changing the end of the path.
So we should use the following pattern:
"docs.python.org/LANGUAGE_TAG/VERSION/".
The current documentation is not moved to "/en/", insted
"docs.python.org/en/" will redirect to "docs.python.org".
Language Tag
''''''''''''
A common notation for language tags is the IETF Language Tag [3]_
[4]_ based on ISO 639, although gettext uses ISO 639 tags with
underscores (ex: ``pt_BR``) instead of dashes to join tags [5]_
(ex: ``pt-BR``). Examples of IETF Language Tags: ``fr`` (French),
``ja`` (Japanese), ``pt-BR`` (Orthographic formulation of 1943 -
Official in Brazil).
It is more common to see dashes instead of underscores in URLs [6]_,
so we should use IETF language tags, even if sphinx uses gettext
internally: URLs are not meant to leak the underlying implementation.
It's uncommon to see capitalized letters in URLs, and docs.python.org
doesn't use any, so it may hurt readability by attracting the eye on it,
like in: "https://docs.python.org/pt-BR/3.6/library/stdtypes.html".
RFC 5646 (Tags for Identifying Languages (IETF)) section-2.1 [7]_
states that tags are not case sensitive. As the RFC allows lower case,
and it enhances readability, we should use lowercased tags like
``pt-br``.
It's redundant to display both language and country code if they're
the same, for example: "de-DE" or "fr-FR". Although it might make
sense, respectively meaning "German as spoken in Germany" and "French
as spoken in France", it's not useful information for the reader. So
we may drop these redundancies and only keep the country code for
cases where it makes sense, for example, "pt-BR" for "Portuguese as
spoken in Brazil".
So we should use IETF language tags, lowercased, like ``/fr/``,
``/pt-br/``, ``/de/`` and so on.
Fetching And Building Translations
''''''''''''''''''''''''''''''''''
Currently docsbuild-scripts are building the documentation [8]_.
These scripts should be modified to fetch and build translations.
Building new translations is like building new versions so, while we're
adding complexity it is not that much.
Two steps should be configurable distinctively: Building a new language,
and adding it to the language switcher. This allows a transition step
between "we accepted the language" and "it is translated enough to be
made public". During this step, translators can review their
modifications on d.p.o without having to build the documentation
locally.
From the translation repositories, only the ``.po`` files should be
opened by the docsbuild-script to keep the attack surface and probable
bug sources at a minimum. This means no translation can patch sphinx
to advertise their translation tool. (This specific feature should be
handled by sphinx anyway [9]_).
Community
---------
Mailing List
''''''''''''
The `doc-sig`_ mailing list will be used to discuss cross-language
changes on translated documentation.
There is also the i18n-sig list but it's more oriented towards i18n APIs
[1]_ than translating the Python documentation.
.. _i18n-sig: https://mail.python.org/mailman/listinfo/i18n-sig
.. _doc-sig: https://mail.python.org/mailman/listinfo/doc-sig
Chat
''''
Due to the Python community being highly active on IRC, we should
create a new IRC channel on freenode, typically #python-doc for
consistency with the mailing list name.
Each language coordinator can organize their own team, even by choosing
another chat system if the local usage asks for it. As local teams
will write in their native languages, we don't want each team in a
single channel. It's also natural for the local teams to reuse
their local channels like "#python-fr" for French translators.
Repository for PO Files
'''''''''''''''''''''''
Considering that each translation team may want to use different
translation tools, and that those tools should easily be synchronized
with git, all translations should expose their ``.po`` files via a git
repository.
Considering that each translation will be exposed via git
repositories, and that Python has migrated to GitHub, translations
will be hosted on github.
For consistency and discoverability, all translations should be in the
same github organization and named according to a common pattern.
Given that we want translations to be official, and that Python
already has a github organization, translations should be hosted as
projects of the `Python GitHub organization`_.
For consistency, translation repositories should be called
``python-docs-LANGUAGE_TAG`` [22]_, using the language tag used in
paths: without region subtag if redundent, and lowercased.
The docsbuild-scripts may enforce this rule by refusing to fetch
outside of the Python organization or a wrongly named repository.
The CLA bot may be used on the translation repositories, but with a
limited effect as local coordinators may synchronize themselves with
translations from an external tool, like transifex, and loose track
of who translated what in the process.
Versions can be hosted on different repositories, different directories
or different branches. Storing them on different repositories will
probably pollute the Python github organization. As it
is typical and natural to use branches to separate versions, branches
should be used to do so.
.. _Python GitHub organization: https://github.com/python/
Translation tools
'''''''''''''''''
Most of the translation work is actually done on Transifex [15]_.
Other tools may be used later https://pontoon.mozilla.org/
and http://zanata.org/
Contributor Agreement
'''''''''''''''''''''
Contributions to translated documentation will be requested to sign
the Python Contributor Agreement (CLA):
https://www.python.org/psf/contrib/contrib-form/
Language Team
'''''''''''''
Each language team should have one coordinator responsible for:
- Managing the team.
- Choosing and managing the tools the team will use (chat, mailing list, …).
- Ensure contributors understand and agree with the CLA.
- Ensure quality (grammar, vocabulary, consistency, filtering spam, ads, …).
- Redirect issues posted on b.p.o to the correct GitHub issue tracker
for the language.
The license will be the `PSF License
<https://docs.python.org/3/license.html>`_, and copyright should be
transferable to PSF later.
Alternatives
------------
Simplified English
''''''''''''''''''
It would be possible to introduce a "simplified English" version like
wikipedia did [10]_, as discussed on python-dev [11]_, targeting
English learners and children.
Pros: It yields a single translation, theoretically readable by
everyone and reviewable by current maintainers.
Cons: Subtle details may be lost, and translators from English to English
may be hard to find as stated by Wikipedia:
> The main English Wikipedia has 5 million articles, written by nearly
140K active users; the Swedish Wikipedia is almost as big, 3M articles
from only 3K active users; but the Simple English Wikipedia has just
123K articles and 871 active users. That's fewer articles than
Esperanto!
Changes
=======
Migrate GitHub Repositories
---------------------------
We (authors of this PEP) already own French and Japanese Git repositories,
so moving them to the Python documentation organization will not be a
problem. We'll however be following the `New Translation Procedure`_.
Patch docsbuild-scripts to Compile Translations
-----------------------------------------------
Docsbuild-script must be patched to:
- List the language tags to build along with the branches to build.
- List the language tags to display in the language switcher.
- Find translation repositories by formatting
``github.com:python/python-docs-{language_tag}.git`` (See
`Repository for Po Files`_)
- Build translations for each branch and each language.
Patched docsbuild-scripts must only open ``.po`` files from
translation repositories.
List coordinators in the devguide
---------------------------------
Add a page or a section with an empty list of coordinators to the
devguide, each new coordinator will be added to this list.
Create sphinx-doc Language Switcher
-----------------------------------
Highly similar to the version switcher, a language switcher must be
implemented. This language switcher must be configurable to hide or
show a given language.
The language switcher will only have to update or add the language
segment to the path like the current version switcher does. Unlike
the version switcher, no preflight are required as destination page
always exists (translations does not add or remove pages).
Untranslated (but existing) pages still exists, they should however be
rendered as so, see `Enhance Rendering of Untranslated and Fuzzy
Translations`_.
Update sphinx-doc Version Switcher
----------------------------------
The ``patch_url`` function of the version switcher in
``version_switch.js`` have to be updated to understand and allow the
presence of the language segment in the path.
Enhance Rendering of Untranslated and Fuzzy Translations
--------------------------------------------------------
It's an opened sphinx issue [9]_, but we'll need it so we'll have to
work on it. Translated, fuzzy, and untranslated paragraphs should be
differentiated. (Fuzzy paragraphs have to warn the reader what he's
reading may be out of date.)
New Translation Procedure
=========================
Designate a Coordinator
-----------------------
The first step is to designate a coordinator, see `Language Team`_,
The coordinator must sign the CLA.
The coordinator should be added to the list of translation coordinators
on the devguide.
Create Github Repository
------------------------
Create a repository named "python-docs-{LANGUAGE_TAG}" (IETF language
tag, without redundent region subtag, with a dash, and lowercased.) on
the Python github organization (See `Repository For Po Files`_.), and
grant the language coordinator push rights to this repository.
Add support for translations in docsbuild-scripts
-------------------------------------------------
As soon as the translation hits its first commits, update the
docsbuild-scripts configuration to build the translation (but not
displaying it in the language switcher).
Add Translation to the Language Switcher
----------------------------------------
As soon as the translation hits:
- 100% of bugs.html with proper links to the language repository
issue tracker.
- 100% of tutorial.
- 100% of library/functions (builtins).
the translation can be added to the language switcher.
Previous Discussions
====================
- `[Python-ideas] Cross link documentation translations (January, 2016)`_
- `[Python-ideas] Cross link documentation translations (January, 2016)`_
- `[Python-ideas] https://docs.python.org/fr/ ? (March 2016)`_
.. _[Python-ideas] Cross link documentation translations (January, 2016):
https://mail.python.org/pipermail/python-ideas/2016-January/038010.html
.. _[Python-Dev] Translated Python documentation (Febrary 2016):
https://mail.python.org/pipermail/python-dev/2017-February/147416.html
.. _[Python-ideas] https://docs.python.org/fr/ ? (March 2016):
https://mail.python.org/pipermail/python-ideas/2016-March/038879.html
References
==========
.. [1] [I18n-sig] Hello Python members, Do you have any idea about
Python documents?
(https://mail.python.org/pipermail/i18n-sig/2013-September/002130.html)
.. [2] [Doc-SIG] Localization of Python docs
(https://mail.python.org/pipermail/doc-sig/2013-September/003948.html)
.. [3] Tags for Identifying Languages
(http://tools.ietf.org/html/rfc5646)
.. [4] IETF language tag
(https://en.wikipedia.org/wiki/IETF_language_tag)
.. [5] GNU Gettext manual, section 2.3.1: Locale Names
(https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html)
.. [6] Semantic URL: Slug
(https://en.wikipedia.org/wiki/Semantic_URL#Slug)
.. [7] Tags for Identifying Languages: Formatting of Language Tags
(https://tools.ietf.org/html/rfc5646#section-2.1.1)
.. [8] Docsbuild-scripts github repository
(https://github.com/python/docsbuild-scripts/)
.. [9] i18n: Highlight untranslated paragraphs
(https://github.com/sphinx-doc/sphinx/issues/1246)
.. [10] Wikipedia: Simple English
(https://simple.wikipedia.org/wiki/Main_Page)
.. [11] Python-dev discussion about simplified english
(https://mail.python.org/pipermail/python-dev/2017-February/147446.html)
.. [12] Passing options to sphinx from Doc/Makefile
(https://github.com/python/cpython/commit/57acb82d275ace9d9d854b156611e641f6…)
.. [13] French translation progression
(https://mdk.fr/pycon2016/#/11)
.. [14] French translation contributors
(https://github.com/AFPy/python_doc_fr/graphs/contributors?from=2016-01-01&t…)
.. [15] Python-doc on Transifex
(https://www.transifex.com/python-doc/)
.. [16] French translation
(https://www.afpy.org/doc/python/)
.. [17] French translation github
(https://github.com/AFPy/python_doc_fr)
.. [18] French mailing list
(http://lists.afpy.org/mailman/listinfo/traductions)
.. [19] Japanese translation
(http://docs.python.jp/3/)
.. [20] Japanese github
(https://github.com/python-doc-ja/python-doc-ja)
.. [21] Spanish translation
(http://docs.python.org.ar/tutorial/3/index.html)
.. [22] [Python-Dev] Translated Python documentation: doc vs docs
(https://mail.python.org/pipermail/python-dev/2017-February/147472.html)
.. [23] Domains - SEO Best Practices | Moz
(https://moz.com/learn/seo/domain)
.. [24] Requirements for Internet Hosts -- Application and Support
(https://www.ietf.org/rfc/rfc1123.txt)
.. [25] Accept-Language
(https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language)
.. [26] Документация Python 2.7!
(http://python-lab.ru/documentation/index.html)
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
========================================
.. [1] [Python-ideas] PEP: Python Documentation Translations
(https://mail.python.org/pipermail/python-ideas/2017-March/045226.html)
.. [2] [Python-Dev] Translated Python documentation
(https://mail.python.org/pipermail/python-dev/2017-February/147416.html)
Bests,
--
Julien Palard
https://mdk.fr
6
11
Hi folks,
Late last year I started working on a change to the CPython CLI (*not* the
shared library) to get it to coerce the legacy C locale to something based
on UTF-8 when a suitable locale is available.
After a couple of rounds of iteration on linux-sig and python-ideas, I'm
now bringing it to python-dev as a concrete proposal for Python 3.7.
For most folks, reading the Abstract plus the draft docs updates in the
reference implementation will tell you everything you need to know (if the
C.UTF-8, C.utf8 or UTF-8 locales are available, the CLI will automatically
attempt to coerce the legacy C locale to one of those rather than
persisting with the latter's default assumption of ASCII as the preferred
text encoding).
However, the full PEP goes into a lot more detail on:
* exactly what's broken about CPython's behaviour in the legacy C locale
* why I'm in favour of this particular approach to fixing it (i.e. it
integrates better with other C/C++ components, as well as being amenable to
redistributor backports for 3.6, and environment based configuration for
3.5 and earlier)
* why I think implementing both this change *and* Victor's more
comprehensive "PYTHONUTF8 mode" proposal in PEP 540 will be better than
implementing just one or the other (in some situations, ignoring the
platform locale subsystem entirely really is the right approach, and that's
the aspect PEP 540 tackles, while this PEP tackles the situations where the
C locale behaviour is broken, but you still need to be consistent with the
platform settings).
Cheers,
Nick.
==================================
PEP: 538
Title: Coercing the legacy C locale to a UTF-8 based locale
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan(a)gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 28-Dec-2016
Python-Version: 3.7
Post-History: 03-Jan-2017 (linux-sig),
07-Jan-2017 (python-ideas),
05-Mar-2017 (python-dev)
Abstract
========
An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency
with
other C/C++ components in the same process and those invoked in
subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) typically
implies a default text encoding of ASCII, which is entirely inadequate for
the
development of networked services and client applications in a multilingual
world.
PEP 540 proposes a change to CPython's handling of the legacy C locale such
that CPython will assume the use of UTF-8 in such environments, rather than
persisting with the demonstrably problematic assumption of ASCII as an
appropriate encoding for communicating with operating system interfaces.
This is a good approach for cases where network encoding interoperability
is a more important concern than local encoding interoperability.
However, it comes at the cost of making CPython's encoding assumptions
diverge
from those of other C and C++ components in the same process, as well as
those
of components running in subprocesses that share the same environment.
It also requires changes to the internals of how CPython itself works,
rather
than using existing configuration settings that are supported by Python
versions prior to Python 3.7.
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
in PEP 540, the way the CPython implementation handles the default C locale
be
changed such that:
* unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to
``0``,
the standalone CPython binary will automatically attempt to coerce the
``C``
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
``UTF-8``
* if the locale is successfully coerced, and PEP 540 is not accepted, then
``PYTHONIOENCODING`` (if not otherwise set) will be set to
``utf-8:surrogateescape``.
* if the locale is successfully coerced, and PEP 540 *is* accepted, then
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
* if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or
``UTF-8``
are available, locale coercion is disabled, or the runtime is embedded in
an
application other than the main CPython binary), and the ``PYTHONUTF8``
feature defined in PEP 540 is also disabled (or not implemented), it will
emit a warning on stderr that use of the legacy ``C`` locale's default
ASCII
text encoding may cause various Unicode compatibility issues
With this change, any \*nix platform that does *not* offer at least one of
the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for
CPython
3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540
is
used, or else a suitable locale other than the default ``C`` locale is
configured explicitly (e.g. `en_AU.UTF-8`, ``zh_CN.gb18030``).
Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
locale coercion behaviour for the Python 3.6.x series by applying the
necessary
changes as a downstream patch when first introducing Python 3.6.0.
Background
==========
While the CPython interpreter is starting up, it may need to convert from
the ``char *`` format to the ``wchar_t *`` format, or from one of those
formats
to ``PyUnicodeObject *``, in a way that's consistent with the locale
settings
of the overall system. It handles these cases by relying on the operating
system to do the conversion and then ensuring that the text encoding name
reported by ``sys.getfilesystemencoding()`` matches the encoding used during
this early bootstrapping process.
On Apple platforms (including both Mac OS X and iOS), this is
straightforward,
as Apple guarantees that these operations will always use UTF-8 to do the
conversion.
On Windows, the limitations of the ``mbcs`` format used by default in these
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary
data
handling and force the use of UTF-8 instead.
On Android, many components, including CPython, already assume the use of
UTF-8
as the system encoding, regardless of the locale setting. However, this
isn't
the case for all components, and the discrepancy can cause problems in some
situations (for example, when using the GNU readline module [16_]).
On non-Apple and non-Android \*nix systems, these operations are handled
using
the C locale system in glibc, which has the following characteristics [4_]:
* by default, all processes start in the ``C`` locale, which uses ``ASCII``
for these conversions. This is almost never what anyone doing multilingual
text processing actually wants (including CPython and C/C++ GUI
frameworks).
* calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on
the locale categories configured in the current process environment
* if the locale requested by the current environment is unknown, or no
specific
locale is configured, then the default ``C`` locale will remain active
The specific locale category that covers the APIs that CPython depends on is
``LC_CTYPE``, which applies to "classification and conversion of characters,
and to multibyte and wide characters" [5_]. Accordingly, CPython includes
the
following key calls to ``setlocale``:
* in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to
configure the entire C locale subsystem according to the process
environment.
It does this prior to making any calls into the shared CPython library
* in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that
the configured locale settings for that category *always* match those set
in
the environment. It does this unconditionally, and it *doesn't* revert the
process state change in ``Py_Finalize``
(This summary of the locale handling omits several technical details related
to exactly where and when the text encoding declared as part of the locale
settings is used - see PEP 540 for further discussion, as these particular
details matter more when decoupling CPython from the declared C locale than
they do when overriding the locale with one based on UTF-8)
These calls are usually sufficient to provide sensible behaviour, but they
can
still fail in the following cases:
* SSH environment forwarding means that SSH clients may sometimes forward
client locale settings to servers that don't have that locale installed.
This
leads to CPython running in the default ASCII-based C locale
* some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale
The simplest way to deal with this problem for currently released versions
of
CPython is to explicitly set a more sensible locale when launching the
application. For example::
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for
the
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all
other
categories (including ``LC_COLLATE``). It is offered by a number of Linux
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
alternative to the ASCII-based C locale.
Mac OS X and other \*BSD systems have taken a different approach, and
instead
of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale
that
only defines the ``LC_CTYPE`` category. On such systems, the preferred
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to
set
``LC_ALL`` or ``LANG``. [17_]
In the specific case of Docker containers and similar technologies, the
appropriate locale setting can be specified directly in the container image
definition.
Another common failure case is developers specifying ``LANG=C`` in order to
see otherwise translated user interface messages in English, rather than the
more narrowly scoped ``LC_MESSAGES=C``.
Relationship with other PEPs
============================
This PEP shares a common problem statement with PEP 540 (improving Python
3's
behaviour in the default C locale), but diverges markedly in the proposed
solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from
the C locale system in that case, allowing text handling inconsistencies
to
arise between CPython and other C/C++ components running in the same
process
and in subprocesses. This approach aims to make CPython behave less like a
locale-aware C/C++ application, and more like C/C++ independent language
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
* this PEP proposes to override the legacy C locale with a more recently
defined locale that uses UTF-8 as its default text encoding. This means
that
the text encoding override will apply not only to CPython, but also to any
locale aware extension modules loaded into the current process, as well
as to
locale aware C/C++ applications invoked in subprocesses that inherit their
environment from the parent process. This approach aims to retain
CPython's
traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale in the wider C/C++ ecosystem
After reviewing both PEPs, it became clear that they didn't actually
conflict
at a technical level, and the proposal in PEP 540 offered a superior option
in
cases where no suitable locale was available, as well as offering a better
reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather
than the CPython reference interpreter).
Meanwhile, this PEP offered improved compatibility with other C/C++
components,
and an approach more amenable to being backported to Python 3.6 by
downstream
redistributors.
As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour both when locale coercion
triggered,
as well as when none of the standard UTF-8 based locales were available.
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8``
legacy
fallback was removed from the list of UTF-8 locales tried as a coercion
target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
cases.
Motivation
==========
While Linux container technologies like Docker, Kubernetes, and OpenShift
are
best known for their use in web service development, the related container
formats and execution models are also being adopted for Linux command line
application development. Technologies like Gnome Flatpak [7_] and
Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI
application development.
When using Python 3 for application development in these contexts, it isn't
uncommon to see text encoding related errors akin to the following::
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in
position 7: surrogates not allowed
$ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in
position 7: surrogates not allowed
Even though the same command is likely to work fine when run locally::
$ python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
The source of the problem can be seen by instead running the ``locale``
command
in the three environments::
$ locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=en_AU.UTF-8
LC_CTYPE="en_AU.UTF-8"
LC_ALL=
$ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=
LC_CTYPE="POSIX"
LC_ALL=
$ docker run --rm ncoghlan/debian-python locale | grep -E
'LC_ALL|LC_CTYPE|LANG'
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_ALL=
In this particular example, we can see that the host system locale is set to
"en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By
contrast,
the base Docker images for Fedora and Debian don't have any specific locale
set, so they use the POSIX locale by default, which is an alias for the
ASCII-based default C locale.
The simplest way to get Python 3 (regardless of the exact version) to behave
sensibly in Fedora and Debian based containers is to run it in the
``C.UTF-8``
locale that both distros provide::
$ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c
'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E
'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_ALL=
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep
-E 'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_ALL=
The Alpine Linux based Python images provided by Docker, Inc, already use
the
C.UTF-8 locale by default::
$ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_ALL=
Similarly, for custom container images (i.e. those adding additional
content on
top of a base distro image), a more suitable locale can be set in the image
definition so everything just works by default. However, it would provide a
much
nicer and more consistent user experience if CPython were able to just deal
with this problem automatically rather than relying on redistributors or end
users to handle it through system configuration changes.
While the glibc developers are working towards making the C.UTF-8 locale
universally available for use by glibc based applications like CPython [6_],
this unfortunately doesn't help on platforms that ship older versions of
glibc
without that feature, and also don't provide C.UTF-8 as an on-disk locale
the
way Debian and Fedora do. For these platforms, the mechanism proposed in
PEP 540 at least allows CPython itself to behave sensibly, albeit without
any
mechanism to get other C/C++ components that decode binary streams as text
to
do the same.
Design Principles
=================
The above motivation leads to the following core design principles for the
proposed solution:
* if a locale other than the default C locale is explicitly configured,
we'll
continue to respect it
* if we're changing the locale setting without an explicit config option,
we'll
emit a warning on stderr that we're doing so rather than silently changing
the process configuration. This will alert application and system
integrators
to the change, even if they don't closely follow the PEP process or Python
release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this *without* using the warnings
system, so
even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options
To minimize the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads
to
an additional design principle:
* if a UTF-8 based Linux container is run on a host that is explicitly
configured to use a non-UTF-8 encoding, and tries to exchange locally
encoded data with that host rather than exchanging explicitly UTF-8
encoded
data, CPython will endeavour to correctly round-trip host provided data
that
is concatenated or split solely at common ASCII compatible code points,
but
may otherwise emit nonsensical results.
Specification
=============
To better handle the cases where CPython would otherwise end up attempting
to operate in the ``C`` locale, this PEP proposes that CPython automatically
attempt to coerce the legacy ``C`` locale to a UTF-8 based locale when it is
run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized,
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
system and application integrators that they're running CPython in an
unsupported configuration.
Legacy C locale coercion in the standalone Python interpreter binary
--------------------------------------------------------------------
When run as a standalone application, CPython has the opportunity to
reconfigure the C locale before any locale dependent operations are executed
in the process.
This means that it can change the locale settings not only for the CPython
runtime, but also for any other C/C++ components running in the current
process (e.g. as part of extension modules), as well as in subprocesses that
inherit their environment from the current process.
After calling ``setlocale(LC_ALL, "")`` to initialize the locale settings in
the current process, the main interpreter binary will be updated to include
the following call::
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
This cryptic invocation is the API that C provides to query the current
locale
setting without changing it. Given that query, it is possible to check for
exactly the ``C`` locale with ``strcmp``::
ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C
locale
This call also returns ``"C"`` when either no particular locale is set, or
the
nominal locale is set to an alias for the ``C`` locale (such as ``POSIX``).
Given this information, CPython can then attempt to coerce the locale to one
that uses UTF-8 rather than ASCII as the default encoding.
Three such locales will be tried:
* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
expected to be available by default in a future version of glibc)
* ``C.utf8`` (available at least in HP-UX)
* ``UTF-8`` (available in at least some \*BSD variants)
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
locale name, such that future calls to ``setlocale()`` will see them, as
will
other components looking for those settings (such as GUI development
frameworks).
For the platforms where it is defined, ``UTF-8`` is a partial locale that
only
defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
environment variable would be set when using this fallback option.
To adjust automatically to future changes in locale availability, these
checks
will be implemented at runtime on all platforms other than Mac OS X and
Windows,
rather than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the
``PYTHONIOENCODING``
environment variable is currently unset, then it will be forced to
``PYTHONIOENCODING=utf-8:surrogateescape``.
When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
successfully configured::
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion
behaviour).
When falling back to the ``UTF-8`` locale, the message would be slightly
different::
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
In combination with PEP 540, this locale coercion will mean that the
standard
Python binary *and* locale aware C/C++ extensions should once again "just
work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly
requesting ``LANG=C``), as long as the target platform provides at least one
of the candidate UTF-8 based environments.
If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is
successfully configured, then initialization will continue as usual in the C
locale and the Unicode compatibility warning described in the next section
will
be emitted just as it would for any other application.
The interpreter will always check for the ``PYTHONCOERCECLOCALE``
environment
variable (even when running under the ``-E`` or ``-I`` switches), as the
locale
coercion check necessarily takes place before any command line argument
processing.
Changes to the runtime initialization process
---------------------------------------------
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that
by the time it is called, it is *too late* to switch to a different locale -
doing so would introduce inconsistencies in decoded text, even in the
context
of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
feature from PEP 540 is disabled, the following warning will
be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8
C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
locales is recommended.
In this case, no actual change will be made to the locale settings.
Instead, the warning informs both system and application integrators that
they're running Python 3 in a configuration that we don't expect to work
properly.
The second sentence providing recommendations would be conditionally
compiled
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
systems.
New build-time configuration options
------------------------------------
While both of the above behaviours would be enabled by default, they would
also have new associated configuration options and preprocessor definitions
for the benefit of redistributors that want to override those default
settings.
The locale coercion behaviour would be controlled by the flag
``--with[out]-c-locale-coercion``, which would set the
``PY_COERCE_C_LOCALE``
preprocessor definition.
The locale warning behaviour would be controlled by the flag
``--with[out]-c-locale-warning``, which would set the
``PY_WARN_ON_C_LOCALE``
preprocessor definition.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
Windows) these preprocessor variables would always be undefined.
Platform Support Changes
========================
A new "Legacy C Locale" section will be added to PEP 11 that states:
* as of CPython 3.7, the legacy C locale is only supported when operating in
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
and cannot be reproduced in an appropriately configured non-ASCII locale
will
be closed as "won't fix"
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems with C/C++ extensions that occur
only in that locale and cannot be reproduced in an appropriately
configured
non-ASCII locale will be closed as "won't fix".
Rationale
=========
Improving the handling of the C locale
--------------------------------------
It has been clear for some time that the C locale's default encoding of
``ASCII`` is entirely the wrong choice for development of modern networked
services. Newer languages like Rust and Go have eschewed that default
entirely,
and instead made it a deployment requirement that systems be configured to
use
UTF-8 as the text encoding for operating system interfaces. Similarly,
Node.js
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript
engine)
and requires custom build settings to indicate it should use the system
locale settings for locale-aware operations. Both the JVM and the .NET CLR
use UTF-16-LE as their primary encoding for passing text between
applications
and the underlying platform.
The challenge for CPython has been the fact that in addition to being used
for
network service development, it is also extensively used as an embedded
scripting language in larger applications, and as a desktop application
development language, where it is more important to be consistent with other
C/C++ components sharing the same process, as well as with the user's
desktop
locale settings, than it is with the emergent conventions of modern network
service development.
The core premise of this PEP is that for *all* of these use cases, the
assumption of ASCII implied by the default "C" locale is the wrong choice,
and furthermore that the following assumptions are valid:
* in desktop application use cases, the process locale will *already* be
configured appropriately, and if it isn't, then that is an operating
system
or embedding application level problem that needs to be reported to and
resolved by the operating system provider or application developer
* in network service development use cases (especially those based on Linux
containers), the process locale may not be configured *at all*, and if it
isn't, then the expectation is that components will impose their own
default
encoding the way Rust, Go and Node.js do, rather than trusting the legacy
C
default encoding of ASCII the way CPython currently does
Defaulting to "surrogateescape" error handling on the standard IO streams
-------------------------------------------------------------------------
By coercing the locale away from the legacy C default and its assumption of
ASCII as the preferred text encoding, this PEP also disables the implicit
use
of the "surrogateescape" error handler on the standard IO streams that was
introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
Rather than introducing yet another configuration option to address that,
this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
that the ``surrogateescape`` handler is enabled when the interpreter is
required to make assumptions regarding the expected filesystem encoding.
The aim of this behaviour is to attempt to ensure that operating system
provided text values are typically able to be transparently passed through a
Python 3 application even if it is incorrect in assuming that that text has
been encoded as UTF-8.
In particular, GB 18030 [12_] is a Chinese national text encoding standard
that handles all Unicode code points, that is formally incompatible with
both
ASCII and UTF-8, but will nevertheless often tolerate processing as
surrogate
escaped data - the points where GB 18030 reuses ASCII byte values in an
incompatible way are likely to be invalid in UTF-8, and will therefore be
escaped and opaque to string processing operations that split on or search
for
the relevant ASCII code points. Operations that don't involve splitting on
or
searching for particular ASCII or Unicode code point values are almost
certain to work correctly.
Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in
Japan, and are incompatible with both ASCII and UTF-8, but will tolerate
text
processing operations that don't involve splitting on or searching for
particular ASCII or Unicode code point values.
As an example, consider two files, one encoded with UTF-8 (the default
encoding
for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding
for
``zh_CN.gb18030``)::
$ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
$ python3 -c 'open("gb18030.txt",
"wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
On disk, we can see that these are two very different files::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip());
\
print("GB18030:", open("gb18030.txt",
"rb").read().strip())'
UTF-8:
b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
GB18030:
b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
That nevertheless can both be rendered correctly to the terminal as long as
they're decoded prior to printing::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())'
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
By contrast, if we just pass along the raw bytes, as ``cat`` and similar
C/C++
utilities will tend to do::
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
�6�6�0�0�7�9�6�4�0�3�6�6
Even setting a specifically Chinese locale won't help in getting the
GB-18030 encoded file rendered correctly::
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
�6�6�0�0�7�9�6�4�0�3�6�6
The problem is that the *terminal* encoding setting remains UTF-8,
regardless
of the nominal locale. A GB18030 terminal can be emulated using the
``iconv``
utility::
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
鈩櫰粹槀鈩屆羔激
ℙƴ☂ℌøἤ
This reverses the problem, such that the GB18030 file is rendered correctly,
but the UTF-8 file has been converted to unrelated hanzi characters, rather
than
the expected rendering of "Python" as non-ASCII characters.
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python
results
in *both* files being displayed incorrectly::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: 鈩櫰粹槀鈩屆羔激
GB18030: 鈩櫰粹槀鈩屆羔激
However, setting the locale correctly means that the emulated GB18030
terminal
now displays both files as originally intended::
$ LANG=zh_CN.gb18030 \
python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
The rationale for retaining ``surrogateescape`` as the default IO encoding
is
that it will preserve the following helpful behaviour in the C locale::
$ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is
explicitly configured::
$ cat gb18030.txt \
| python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0:
invalid start byte
Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
proposes would be to instead *always* default to ``surrogateescape`` on the
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to
request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal
encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys;
print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Dropping official support for ASCII based text handling in the legacy C
locale
------------------------------------------------------------------------------
We've been trying to get strict bytes/text separation to work reliably in
the
legacy C locale for over a decade at this point. Not only haven't we been
able
to get it to work, neither has anyone else - the only viable alternatives
identified have been to pass the bytes along verbatim without eagerly
decoding
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that need to do so can still opt-in
to
running their Python code in the legacy C locale, it also makes clear that
we
*don't* expect Python 3's Unicode handling to be reliable in that
configuration,
and the recommended alternative is to use a more appropriate locale setting.
Providing implicit locale coercion only when running standalone
---------------------------------------------------------------
Over the course of Python 3.x development, multiple attempts have been made
to improve the handling of incorrect locale settings at the point where the
Python interpreter is initialised. The problem that emerged is that this is
ultimately *too late* in the interpreter startup process - data such as
command
line arguments and the contents of environment variables may have already
been
retrieved from the operating system and processed under the incorrect ASCII
text encoding assumption well before ``Py_Initialize`` is called.
The problems created by those inconsistencies were then even harder to
diagnose
and debug than those created by believing the operating system's claim that
ASCII was a suitable encoding to use for operating system interfaces. This
was
the case even for the default CPython binary, let alone larger C/C++
applications that embed CPython as a scripting engine.
The approach proposed in this PEP handles that problem by moving the locale
coercion as early as possible in the interpreter startup sequence when
running
standalone: it takes place directly in the C-level ``main()`` function, even
before calling in to the `Py_Main()`` library function that implements the
features of the CPython interpreter CLI.
The ``Py_Initialize`` API then only gains an explicit warning (emitted on
``stderr``) when it detects use of the ``C`` locale, and relies on the
embedding application to specify something more reasonable.
Querying LC_CTYPE for C locale detection
----------------------------------------
``LC_CTYPE`` is the actual locale category that CPython relies on to drive
the
implicit decoding of environment variables, command line arguments, and
other
text values received from the operating system.
As such, it makes sense to check it specifically when attempting to
determine
whether or not the current locale configuration is likely to cause Unicode
handling problems.
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
------------------------------------------------------
Python is often used as a glue language, integrating other C/C++ ABI
compatible
components in the current process, and components written in arbitrary
languages in subprocesses.
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
C/C++ components in the current process and in any subprocesses that inherit
the current environment. This is important to handle cases where the problem
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux
server).
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
Together, these should ensure that when the locale coercion is activated,
the
switch to the C.UTF-8 locale will be applied consistently across the current
process and any subprocesses that inherit the current environment.
Allowing restoration of the legacy behaviour
--------------------------------------------
The CPython command line interpreter is often used to investigate faults
that
occur in other applications that embed CPython, and those applications may
still
be using the C locale even after this PEP is implemented.
Providing a simple on/off switch for the locale coercion behaviour makes it
much easier to reproduce the behaviour of such applications for debugging
purposes, as well as making it easier to reproduce the behaviour of older
3.x
runtimes even when running a version with this change applied.
Implementation
==============
A draft implementation of the change (including test cases and
documentation)
is linked from issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
This patch is now being maintained as the ``pep538-coerce-c-locale`` feature
branch [18_] in Nick Coghlan's fork of the CPython repository on GitHub.
NOTE: As discussed in [1_], the currently posted draft implementation has
some
known issues on Android.
Backporting to earlier Python 3 releases
========================================
Backporting to Python 3.6.0
---------------------------
If this PEP is accepted for Python 3.7, redistributors backporting the
change
specifically to their initial Python 3.6.0 release will be both allowed and
encouraged. However, such backports should only be undertaken either in
conjunction with the changes needed to also provide a suitable locale by
default, or else specifically for platforms where such a locale is already
consistently available.
Backporting to other 3.x releases
---------------------------------
While the proposed behavioural change is seen primarily as a bug fix
addressing
Python 3's current misbehaviour in the default ASCII-based C locale, it
still
represents a reasonably significant change in the way CPython interacts with
the C locale system. As such, while some redistributors may still choose to
backport it to even earlier Python 3.x releases based on the needs and
interests of their particular user base, this wouldn't be encouraged as a
general practice.
However, configuring Python 3 *environments* (such as base container
images) to use these configuration settings by default is both allowed
and recommended.
Acknowledgements
================
The locale coercion approach proposed in this PEP is inspired directly by
Armin Ronacher's handling of this problem in the ``click`` command line
utility development framework [2_]::
$ LANG=C python3 -c 'import click; cli = click.command()(lambda:None);
cli()'
Traceback (most recent call last):
...
RuntimeError: Click will abort further execution because Python 3 was
configured to use ASCII as encoding for the environment. Either run
this
under Python 2 or consult http://click.pocoo.org/python3/ for mitigation
steps.
This system supports the C.UTF-8 locale which is recommended.
You might be able to resolve your issue by exporting the
following environment variables:
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
The change was originally proposed as a downstream patch for Fedora's
system Python 3.6 package [3_], and then reformulated as a PEP for Python
3.7
with a section allowing for backports to earlier versions by redistributors.
The initial draft was posted to the Python Linux SIG for discussion [10_]
and
then amended based on both that discussion and Victor Stinner's work in
PEP 540 [11_].
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this
PEP
is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation
[9_].
Stephen Turnbull has long provided valuable insight into the text encoding
handling challenges he regularly encounters at the University of Tsukuba
(筑波大学).
References
==========
.. [1] CPython: sys.getfilesystemencoding() should default to utf-8
(http://bugs.python.org/issue28180)
.. [2] Locale configuration required for click applications under Python 3
(http://click.pocoo.org/5/python3/#python-3-surrogate-handling)
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
.. [4] GNU C: How Programs Set the Locale
(
https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)
.. [5] GNU C: Locale Categories
(
https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)
.. [6] glibc C.UTF-8 locale proposal
(https://sourceware.org/glibc/wiki/Proposals/C.UTF-8)
.. [7] GNOME Flatpak
(http://flatpak.org/)
.. [8] Ubuntu Snappy
(https://www.ubuntu.com/desktop/snappy)
.. [9] Pragmatic Unicode
(http://nedbatchelder.com/text/unipain.html)
.. [10] linux-sig discussion of initial PEP draft
(https://mail.python.org/pipermail/linux-sig/2017-January/000014.html)
.. [11] Feedback notes from linux-sig discussion and PEP 540
(https://github.com/python/peps/issues/171)
.. [12] GB 18030
(https://en.wikipedia.org/wiki/GB_18030)
.. [13] Shift-JIS
(https://en.wikipedia.org/wiki/Shift_JIS)
.. [14] ISO-2022
(https://en.wikipedia.org/wiki/ISO/IEC_2022)
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on
UNIX for the C locale
(https://bugs.python.org/issue19977)
.. [16] test_readline.test_nonascii fails on Android
(http://bugs.python.org/issue28997)
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac
OS X with default language set to English"
(http://bugs.python.org/issue18378#msg215215)
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
(
https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c…
)
Copyright
=========
This document has been placed in the public domain under the terms of the
CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia
10
37
Hi everyone,
I possibly found a bug in class __init__ and would like to fix it. So I'm
looking for a mentor to help me.
`class Foo:
def __init__(self, bar=[]):
self.list = bar
spam_1 = Foo()
spam_2 = Foo()
spam_1.list.append(42)
print(spam_2.list)`
At least I think it's a bug. Maybe it's a feature..
Best Regards, Jus
--
Justus Schwabedal
Handy (D): +49 177 939 5281
email: jschwabedal(a)googlemail.com
skype: justus1802
Görlitzer Str. 22
01099 Dresden, Sachsen
Germany
Steinkreuzstr. 23
53757 Sankt Augustin, NRW
Germany
9
9
Hi all,
I'm now looking for cProfile/profile lib's issue, and have solve a series
of dependent problem, here is the list:
#9285 - Add a profile decorator to profile and cProfile
#30113 - Allow helper functions to wrap sys.setprofile
#18971 - Use argparse in the profile/cProfile modules
#30118 - Add unittest for cProfile/profile command line interface
It can divide into two categories, first is the context manager problem,
and the second is optparse to argparse problem.
1. context manager problem:
Relative issue: #9285, #30113
Relative PR: #287, #1212, #1253
This is an issue since 2010, and stop at profile can't simply add a
__enter__ and __exit__ to make it a context manager. The main problem is,
sys.setprofile() will hit the return and get bad return in profile
dispatch_return function. The solution is to insert a simulate call in the
helper function, to provide the context between helper frame and where the
profile is defined.
2. optparse to argparse problem:
Relative issue: #18971, #30118
Relative PR: #1232
Relative patch: profile_argparse.patch
Serhiy have provide the patch of replace optparse to argparse in the
profile and cProfile, but block by Ezio request of unittest for command
line interface. The unittest is provide at #1232, and need to be reivew. If
the unittest is add and argparse patch is apply, we can then solve more
problem, e.g.: #23420, #29238, #21862
This is what I've investigated for cProfile / profile library now,
to be move on, it will need others to review the work.
Thanks!
Best Regards,
Louie.
3
3