For anyone who is interested in more details on the CRT changes, there's a blog post from my colleague who worked on most of them at http://blogs.msdn.com/b/vcblog/archive/2014/06/10/the-great-crt-refactoring.... I wanted to call out one section and add some details: In order to unify these different CRTs [desktop, phone, etc], we have split the CRT into three pieces: 1. VCRuntime (vcruntime140.dll): This DLL contains all of the runtime functionality required for things like process startup and exception handling, and functionality that is coupled to the compiler for one reason or another. We may need to make breaking changes to this library in the future. 2. AppCRT (appcrt140.dll): This DLL contains all of the functionality that is usable on all platforms. This includes the heap, the math library, the stdio and locale libraries, most of the string manipulation functions, the time library, and a handful of other functions. We will maintain backwards compatibility for this part of the CRT. 3. DesktopCRT (desktopcrt140.dll): This DLL contains all of the functionality that is usable only by desktop apps. Notably, this includes the functions for working with multibyte strings, the exec and spawn process management functions, and the direct-to-console I/O functions. We will maintain backwards compatibility for this part of the CRT. The builds of Python I've already made are indeed linked against these three DLLs, though it happens transparently. Most of the APIs are from the AppCRT, which is a good sign as it will simplify portability to other Windows-based platforms (though the direct references to the Win32 API will arise again to complicate this). Very few functions are imported from VCRuntime, which is the only part that *may* have breaking changes in the future (that's the current promise, and I'd expect it to be strengthened one way or the other by releas). Apart from the standard memcpy/strcpy type functions (which may be moved in later builds), these other imports are compiler helpers: * void terminate(void) (currently exported as a decorated C++ function, but that's going to be fixed) * __vcrt_TerminateProcess * __vcrt_UnhandledException * __vcrt_cleanup_type_info_names * _except_handler4_common * _local_unwind4 I've checked with our CRT dev and he says that these don't keep any state (and won't cause problems like we've seen in the past with FILE*), and are only there to deal with potential C++ exceptions - they are included at a point where it is impossible to tell whether C++ is involved, and so can't be removed. My builds pass almost all of regrtest.py and the only issues are with Tcl/tk and OpenSSL, which need to update their compiler version detection. I've built them with changes, though as usual Tcl/tk is a real pain. I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?) Cheers, Steve
Le 10/06/2014 12:30, Steve Dower a écrit :
I ran a quick test with profile-guided optimization (PGO, pronounced
"pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?) I would recommend using the non-trivial suite of benchmarks at http://hg.python.org/benchmarks (both for the profiling and the benchmarking, though you may want to use additional workloads for profiling too) Regards Antoine.
Antoine Pitrou wrote: Le 10/06/2014 12:30, Steve Dower a écrit :
I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?)
I would recommend using the non-trivial suite of benchmarks at http://hg.python.org/benchmarks (both for the profiling and the benchmarking, though you may want to use additional workloads for profiling too)
Regards
Antoine.
Thanks. I knew there was a proper set somewhere, but didn't manage to track it down in the minute or so I spent looking :) Cheers, Steve
2014-06-10 18:30 GMT+02:00 Steve Dower
I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?)
If we upgrade the compiler on Windows, some optimizer options can maybe be enabled again. Previous Visual Studio (2010?) bugs: * http://bugs.python.org/issue15993 * http://bugs.python.org/issue8847#msg166935 Victor
Am 10.06.14 18:30, schrieb Steve Dower:
I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?)
You probably ran too little Python code. See PCbuild/build_pgo.bat for what used to be part of the release process. It takes quite some time, but it rebuilt more than 1% (IIRC). FWIW, I stopped using PGO for the official releases when it was demonstrated to generate bad code. In my experience, a compiler that generates bad code has lost trust "forever", so it will be hard to justify re-enabling PGO (like "but it really works this time"). I wasn't sad when I found a justification to skip the profiling, since it significantly held up the release process. Regards, Martin
Martin v. Löwis wrote:
Am 10.06.14 18:30, schrieb Steve Dower:
I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?)
You probably ran too little Python code. See PCbuild/build_pgo.bat for what used to be part of the release process. It takes quite some time, but it rebuilt more than 1% (IIRC).
That's almost certainly the case. I didn't run anywhere near enough to call it good, though I'd only really expect the size to get worse and the speed to get better.
FWIW, I stopped using PGO for the official releases when it was demonstrated to generate bad code. In my experience, a compiler that generates bad code has lost trust "forever", so it will be hard to justify re-enabling PGO (like "but it really works this time"). I wasn't sad when I found a justification to skip the profiling, since it significantly held up the release process.
Yeah, and it seems the bad code is still there. I suspect it's actually due to optimizing for space rather than speed, and not due to PGO directly, but either way I'll be trying to get it fixed. [EARLIER EMAIL]
By "keep around", I'd be fine with "in a subdirectory of PC". PCbuild should either switch for sure, or not switch at all. People had proposed to come up with a "PCbuildN" directory (N=10, N=14, or whatever) to maintain two build environments simultaneously; I'd be -1 on such a plan. There needs to be one official toolset to build Python X.Y with, and it needs to be either VS 2010 or VS 2014, but not both.
That's what I have planned. Right now it's in my sandbox and I've just replaced the existing PCbuild contents (rather wholesale - I took the opportunity to simplify the files, which is important to me as I spend most of my time editing them by hand rather than through VS). When/if I merge, the version in PC\VS10.0 will be exactly what was there at merge time.
Regards, Martin
And thanks, I appreciate the context and suggestions. Cheers, Steve
On Tue, Jun 10, 2014 at 9:30 AM, Steve Dower
I ran a quick test with profile-guided optimization (PGO, pronounced "pogo"), which has supposedly been improved since VC9, and saw a very unscientific 20% speed improvement on pybench.py and 10% size reduction in python35.dll. I'm not sure what we used to get from VC9, but it certainly seems worth enabling provided it doesn't break anything. (Interestingly, PGO decided that only 1% of functions needed to be compiled for speed. Not sure if I can find out which ones those are but if anyone's interested I can give it a shot?)
For what it's worth, we build Google's internal Python interpreters with
gcc's flavour of PGO and are seeing somewhat more than 20% performance
increase for Python 2.7. (We train using most of the testsuite, not
pybench, and I believe the Debian/Ubuntu packages also do this.) I believe
almost all of that is from speedups to the main eval loop, which is a huge
switch in a bigger loop with complicated jump logic. It wouldn't surprise
me if VS's PGO only decided to optimize that eval loop :)
--
Thomas Wouters
participants (5)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Steve Dower
-
Thomas Wouters
-
Victor Stinner