Re: [Python-Dev] [Python-checkins] cpython: Implement PEP 393.
Is there some reason str.format had such major surgery done to it? It appears parts of it were removed from stringlib. I had not even thought to look at the code before it was merged, as it never occurred to me anyone would do that. I left it in stringlib even in 3.x because there's the occasional talk of adding bytes.bformat, and since all of the code works well with stringlib (since it was used by str and unicode in 2.x), it made sense to leave it there. In addition, there are outstanding patches that are now broken. I'd prefer it return to how it used to be, and just the minimum changes required for PEP 393 be made to it. Thanks. Eric. On 9/28/2011 2:35 AM, martin.v.loewis wrote:
http://hg.python.org/cpython/rev/8beaa9a37387 changeset: 72475:8beaa9a37387 user: Martin v. Löwis
date: Wed Sep 28 07:41:54 2011 +0200 summary: Implement PEP 393. files: Doc/c-api/unicode.rst | 9 + Include/Python.h | 5 + Include/complexobject.h | 5 +- Include/floatobject.h | 5 +- Include/longobject.h | 6 +- Include/pyerrors.h | 6 + Include/pyport.h | 3 + Include/unicodeobject.h | 783 +- Lib/json/decoder.py | 3 +- Lib/test/json_tests/test_scanstring.py | 11 +- Lib/test/test_codeccallbacks.py | 7 +- Lib/test/test_codecs.py | 4 + Lib/test/test_peepholer.py | 4 - Lib/test/test_re.py | 7 + Lib/test/test_sys.py | 38 +- Lib/test/test_unicode.py | 41 +- Makefile.pre.in | 6 +- Misc/NEWS | 2 + Modules/_codecsmodule.c | 8 +- Modules/_csv.c | 2 +- Modules/_ctypes/_ctypes.c | 6 +- Modules/_ctypes/callproc.c | 8 - Modules/_ctypes/cfield.c | 64 +- Modules/_cursesmodule.c | 7 +- Modules/_datetimemodule.c | 13 +- Modules/_dbmmodule.c | 12 +- Modules/_elementtree.c | 31 +- Modules/_io/_iomodule.h | 2 +- Modules/_io/stringio.c | 69 +- Modules/_io/textio.c | 352 +- Modules/_json.c | 252 +- Modules/_pickle.c | 4 +- Modules/_sqlite/connection.c | 19 +- Modules/_sre.c | 382 +- Modules/_testcapimodule.c | 2 +- Modules/_tkinter.c | 70 +- Modules/arraymodule.c | 8 +- Modules/md5module.c | 10 +- Modules/operator.c | 27 +- Modules/pyexpat.c | 11 +- Modules/sha1module.c | 10 +- Modules/sha256module.c | 10 +- Modules/sha512module.c | 10 +- Modules/sre.h | 4 +- Modules/syslogmodule.c | 14 +- Modules/unicodedata.c | 28 +- Modules/zipimport.c | 141 +- Objects/abstract.c | 4 +- Objects/bytearrayobject.c | 147 +- Objects/bytesobject.c | 127 +- Objects/codeobject.c | 15 +- Objects/complexobject.c | 19 +- Objects/dictobject.c | 20 +- Objects/exceptions.c | 26 +- Objects/fileobject.c | 17 +- Objects/floatobject.c | 19 +- Objects/longobject.c | 84 +- Objects/moduleobject.c | 9 +- Objects/object.c | 10 +- Objects/setobject.c | 40 +- Objects/stringlib/count.h | 9 +- Objects/stringlib/eq.h | 23 +- Objects/stringlib/fastsearch.h | 4 +- Objects/stringlib/find.h | 31 +- Objects/stringlib/formatter.h | 1516 -- Objects/stringlib/localeutil.h | 27 +- Objects/stringlib/partition.h | 12 +- Objects/stringlib/split.h | 26 +- Objects/stringlib/string_format.h | 1385 -- Objects/stringlib/stringdefs.h | 2 + Objects/stringlib/ucs1lib.h | 35 + Objects/stringlib/ucs2lib.h | 34 + Objects/stringlib/ucs4lib.h | 34 + Objects/stringlib/undef.h | 10 + Objects/stringlib/unicode_format.h | 1416 ++ Objects/stringlib/unicodedefs.h | 2 + Objects/typeobject.c | 18 +- Objects/unicodeobject.c | 6112 ++++++++--- Objects/uniops.h | 91 + PC/_subprocess.c | 61 +- PC/import_nt.c | 2 +- PC/msvcrtmodule.c | 8 +- PC/pyconfig.h | 4 - PC/winreg.c | 8 +- Parser/tokenizer.c | 6 +- Python/_warnings.c | 16 +- Python/ast.c | 61 +- Python/bltinmodule.c | 26 +- Python/ceval.c | 17 +- Python/codecs.c | 44 +- Python/compile.c | 89 +- Python/errors.c | 4 +- Python/formatter_unicode.c | 1445 ++- Python/getargs.c | 46 +- Python/import.c | 347 +- Python/marshal.c | 4 +- Python/peephole.c | 18 - Python/symtable.c | 8 +- Python/traceback.c | 59 +- Tools/gdb/libpython.py | 27 +- configure | 65 +- configure.in | 46 +- pyconfig.h.in | 6 -
Am 29.09.2011 01:21, schrieb Eric V. Smith:
Is there some reason str.format had such major surgery done to it?
Yes: I couldn't figure out how to do it any other way. The formatting code had a few basic assumptions which now break (unless you keep using the legacy API). Primarily, the assumption is that there is a notion of a "STRINGLIB_CHAR" which is the element of a string representation. With PEP 393, no such type exists anymore - it depends on the individual object what the element type for the representation is. In other cases, I worked around that by compiling the stringlib three times, for Py_UCS1, Py_UCS2, and Py_UCS4. For one, this gives considerable code bloat, which I didn't like for the formatting code (as that is already a considerable amount of code). More importantly, this approach wouldn't have worked well, anyway, since the formatting combines multiple Unicode objects (especially with the OutputString buffer), and different inputs may have different representations. On top of that, OutputString needs widening support, starting out with a narrow string, and widening step-by-step as input strings are more wide than the current output (or not, if the input strings are all ASCII). It would have been possible to keep the basic structure by doing all formatting in Py_UCS4. This would cost a significant memory and runtime overhead.
In addition, there are outstanding patches that are now broken.
I'm sorry about that. Try applying them to the new files, though - patch may still be able to figure out how to integrate them, as the algorithms and function structure hasn't changed.
I'd prefer it return to how it used to be, and just the minimum changes required for PEP 393 be made to it.
Please try for yourself. On string_format.h, I think there is zero chance, unless you want to compromise and efficiency (in addition to the already-present compromise on code cleanliness, due the the fact that the code is more general than it needs to be). On formatter.h, it may actually be possible to restore what it was - in particular if you can make a guarantee that all number formatting always outputs ASCII-strings only (which I'm not so sure about, as the thousands separator could be any character, in principle). Without that guarantee, it may indeed be reasonable to compile formatter.h in Py_UCS4, since the resulting strings will be small, so the overhead is probably negligible. Regards, Martin
On 10/1/2011 9:26 AM, "Martin v. Löwis" wrote:
Am 29.09.2011 01:21, schrieb Eric V. Smith:
Is there some reason str.format had such major surgery done to it?
Yes: I couldn't figure out how to do it any other way. The formatting code had a few basic assumptions which now break (unless you keep using the legacy API). Primarily, the assumption is that there is a notion of a "STRINGLIB_CHAR" which is the element of a string representation. With PEP 393, no such type exists anymore - it depends on the individual object what the element type for the representation is.
Martin: Thanks so much for your thoughtful answer. You've obviously given this more thought than I have. From your answer, it does indeed sound like string_format.h needs to be removed from stringlib. I'll have to think more about formatter.h. On the other hand, not having this code in stringlib would certainly be liberating! Maybe I'll take this opportunity to clean it up and simplify it now that it's free of the stringlib constraints. Eric.
On Sat, Oct 1, 2011 at 4:07 PM, Eric V. Smith
On the other hand, not having this code in stringlib would certainly be liberating! Maybe I'll take this opportunity to clean it up and simplify it now that it's free of the stringlib constraints.
Yeah, don't sacrifice speed in str.format for a still-hypothetical-and-potentially-never-going-to-happen bytes formatting variant. If the latter does happen, the use cases would be different enough that I'm not even sure the mini-language should remain entirely the same (e.g. you'd likely want direct access to some of the struct module formatting more so than str-style formats). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (3)
-
"Martin v. Löwis"
-
Eric V. Smith
-
Nick Coghlan