Is this safe enough? Re: [Python-checkins] cpython: _Py_Identifier are always ASCII strings
I realize that _Py_Identifier is a private name, and that PEP 3131 requires anything (except test cases) in the standard library to stick with ASCII ... but somehow, that feels like too long of a chain. I would prefer to see _Py_Identifier renamed to _Py_ASCII_Identifier, or at least a comment stating that Identifiers will (per PEP 3131) always be ASCII -- preferably with an assert to back that up. -jJ On Sat, Feb 4, 2012 at 7:46 PM, victor.stinner <python-checkins@python.org> wrote:
http://hg.python.org/cpython/rev/d2c1521ad0a1 changeset: 74772:d2c1521ad0a1 user: Victor Stinner <victor.stinner@haypocalc.com> date: Sun Feb 05 01:45:45 2012 +0100 summary: _Py_Identifier are always ASCII strings
files: Objects/unicodeobject.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/Objects/unicodeobject.c b/Objects/unicodeobject.c --- a/Objects/unicodeobject.c +++ b/Objects/unicodeobject.c @@ -1744,9 +1744,8 @@ _PyUnicode_FromId(_Py_Identifier *id) { if (!id->object) { - id->object = PyUnicode_DecodeUTF8Stateful(id->string, - strlen(id->string), - NULL, NULL); + id->object = unicode_fromascii((unsigned char*)id->string, + strlen(id->string)); if (!id->object) return NULL; PyUnicode_InternInPlace(&id->object);
-- Repository URL: http://hg.python.org/cpython
I would prefer to see _Py_Identifier renamed to _Py_ASCII_Identifier, or at least a comment stating that Identifiers will (per PEP 3131) always be ASCII -- preferably with an assert to back that up.
Please ... no. This is a *convenience* interface, whose sole purpose is to make something more convenient. Adding naming clutter destroys this objective. I'd rather restore support for allowing UTF-8 source here (I don't think that requiring ASCII really improves much), than rename the macro. The ASCII requirement is actually more in the C compiler than in Python. Since not all of the C compilers that we compile Python with support non-ASCII identifiers, failure to comply to the ASCII requirement will trigger a C compilation failure. Regards, Martin
2012/2/6 Jim Jewett <jimjjewett@gmail.com>:
I realize that _Py_Identifier is a private name, and that PEP 3131 requires anything (except test cases) in the standard library to stick with ASCII ... but somehow, that feels like too long of a chain.
I would prefer to see _Py_Identifier renamed to _Py_ASCII_Identifier, or at least a comment stating that Identifiers will (per PEP 3131) always be ASCII -- preferably with an assert to back that up.
_Py_IDENTIFIER(xxx) defines a variable called PyId_xxx, so xxx can only be ASCII: the C language doesn't accept non-ASCII identifiers. I thaugh that _Py_IDENTIFIER() macro was the only way to create a identifier and so ASCII was enough... but there is also _Py_static_string. _Py_static_string(name, value) allows to specify an arbitrary string, so you may pass a non-ASCII value. I don't see any usecase where you need a non-ASCII value in Python core.
- id->object = PyUnicode_DecodeUTF8Stateful(id->string, - strlen(id->string), - NULL, NULL); + id->object = unicode_fromascii((unsigned char*)id->string, + strlen(id->string));
This is just an optimization. If you think that _Py_static_string() is useful, I can revert my change. Otherwise, _Py_static_string() should be removed. Victor
On Mon, 6 Feb 2012 22:57:46 +0100 Victor Stinner <victor.stinner@haypocalc.com> wrote:
- id->object = PyUnicode_DecodeUTF8Stateful(id->string, - strlen(id->string), - NULL, NULL); + id->object = unicode_fromascii((unsigned char*)id->string, + strlen(id->string));
This is just an optimization.
Is the optimization even worthwhile? This code is typically called once for every static string. Regards Antoine.
_Py_IDENTIFIER(xxx) defines a variable called PyId_xxx, so xxx can only be ASCII: the C language doesn't accept non-ASCII identifiers.
That's not exactly true. In C89, source code is in the "source character set", which is implementation-defined, except that it must contain the "basic character set". I believe that it allows for implementation-defined characters in identifiers. In C99, this is extended to include "universal character names" (\u escapes). They may appear in identifiers as long as the characters named are listed in annex D.59 (which I cannot locate). In C 2011, annexes D.1 and D.2 specify the characters that you can use in an identifier: D.1 Ranges of characters allowed 1. 00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF 2. 0100−167F, 1681−180D, 180F−1FFF 3. 200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F 4. 2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF 5. 3004−3007, 3021−302F, 3031−303F 6. 3040−D7FF 7. F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD 8. 10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD D.2 Ranges of characters disallowed initially 1. 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F Regards, Martin
2012/2/7 "Martin v. Löwis" <martin@v.loewis.de>:
_Py_IDENTIFIER(xxx) defines a variable called PyId_xxx, so xxx can only be ASCII: the C language doesn't accept non-ASCII identifiers.
That's not exactly true. In C89, source code is in the "source character set", which is implementation-defined, except that it must contain the "basic character set". I believe that it allows for implementation-defined characters in identifiers.
Hum, I hope that these C89 compilers use UTF-8.
In C99, this is extended to include "universal character names" (\u escapes). They may appear in identifiers as long as the characters named are listed in annex D.59 (which I cannot locate).
Does C99 specify the encoding? Can we expect UTF-8? Python is supposed to work on many platforms ans so support a lot of compilers, not only compilers supporting non-ASCII identifiers. Victor
Why do we still care about C89? It is 2012 and we're talking about Python 3. What compiler on what platform that anyone actually cares about does not support C99? -gps
2012/2/7 Gregory P. Smith <greg@krypto.org>
Why do we still care about C89? It is 2012 and we're talking about Python 3. What compiler on what platform that anyone actually cares about does not support C99?
The Microsoft compilers on Windows do not support C99: - Declarations must be at the start of a block - No designated initializers for structures - Ascii-only identifiers: http://msdn.microsoft.com/en-us/library/e7f8y25b.aspx -- Amaury Forgeot d'Arc
Am 07.02.2012 20:10, schrieb Gregory P. Smith:
Why do we still care about C89? It is 2012 and we're talking about Python 3. What compiler on what platform that anyone actually cares about does not support C99?
As Amaury says: Visual Studio still doesn't support C99. The story is both funny and sad: In Visual Studio 2002, the release notes included a comment that they couldn't consider C99 (in 2002), because of lack of time, and the standard came so quickly. In 2003, they kept this notice. In VS 2005 (IIRC), they said that there is too little customer demand for C99 so that they didn't implement it; they recommended to use C++ or C#, anyway. Now C2011 has been published. Regards, Martin
Does C99 specify the encoding? Can we expect UTF-8?
No, it's implementation-defined. However, that really doesn't matter much for the macro (it does matter for the Mercurial repository): The files on disk are mapped, in an implementation-defined manner, into the source character set. All processing is done there, including any stringification. Then, for string literals, the source character set is converted into the execution character set. So for the definition of the _Py_identifier macro, it really matters what the run-time encoding of the stringified identifiers is.
Python is supposed to work on many platforms ans so support a lot of compilers, not only compilers supporting non-ASCII identifiers.
And your point is? Regards, Martin
participants (7)
-
"Martin v. Löwis"
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Gregory P. Smith
-
Jim Jewett
-
martin@v.loewis.de
-
Victor Stinner