[Python-Dev] PEP 414 - some numbers from the Django port

Sat Mar 3 03:28:55 CET 2012

PEP 414 mentions the use of function wrappers and talks about both their
obtrusiveness and performance impact on Python code. In the Django Python 3
port, I've used unicode_literals, and hence have no u prefixes in the ported
code, and use a function wrapper to adorn native strings where they are needed.

Though the port is still work in progress, it passes all tests on 2.x and 3.x
with the SQLite adapter, with only a small number skipped specifically during
the porting exercise (generally due to representational differences).

I'd like to share some numbers from this port to see what people here think
about them.

Firstly, on obtrusiveness: Out of a total of 1872 source files, the native
string marker only appears in 30 files - 18 files in Django itself, and 12
files in the test suite. This is less than 2% of files, so the native string
markers are not especially invasive when looking at code. There are only 76
lines in the ported Django which contain native string markers.

Secondly, on performance. I ran the following steps 6 times:

Run the test suite on unported Django using Python 2.7.2 ("vanilla")
Run the test suite on the ported Django using Python 2.7.2 ("ported")
Run the test suite on the ported Django using Python 3.2.2 ("ported3")

Django skips some tests because dependencies aren't installed (e.g. PIL for
Python 3.2). The raw numbers, in seconds elapsed for the test run, are given
below:

vanilla (4659 tests): 468.586 486.231 467.584 464.916 480.530 475.457
ported (4655 tests):  467.350 480.902 479.276 478.748 478.115 486.044
ported3 (4609 tests): 463.161 470.423 463.833 448.097 456.727 504.402

If we allow for the different numbers of tests run by dividing by the number
of tests and multiplying by 100, we get:

vanilla-weighted: 10.057 10.436 10.036  9.979 10.314 10.205
ported-weighted:  10.040 10.331 10.296 10.285 10.271 10.441
ported3-weighted: 10.049 10.207 10.064  9.722  9.909 10.944

If I run these through ministat, it tells me there is no significant
difference in these data sets, with a 95% confidence level:

$ ministat -w 74 vanilla-weighted ported-weighted ported3-weighted 
x vanilla-weighted
+ ported-weighted
* ported3-weighted
+--------------------------------------------------------------------------+
|                    *             +                                       |
|*          *   x   **        *   ++x+      *                             *|
||_______________|___M____|AA_M___AM___|__|_________|                      |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   6         9.979        10.436        10.205     10.171167    0.17883782
+   6         10.04        10.441        10.296     10.277333    0.13148485
No difference proven at 95.0% confidence
*   6         9.722        10.944        10.064     10.149167    0.42250274
No difference proven at 95.0% confidence

So, looking at a large project in a relevant problem domain, unicode_literals
and native string markers would appear not to adversely impact readability or
performance.

Your comments would be appreciated.

Regards,

Vinay Sajip