SF patch 864863: Bisect C implementation

Dmitry Vasiliev has submitted a C implementation for the bisect module. My thoughts are to accept it with the following changes: * leave the original code intact for teaching purposes * import the C version from _bisect * make sure the C code is not list specific and will work with any container supporting __len__(), __getitem__(), and insert(). Do you guys have any objections? Raymond Hettinger

Dmitry Vasiliev has submitted a C implementation for the bisect module.
My thoughts are to accept it with the following changes:
* leave the original code intact for teaching purposes * import the C version from _bisect * make sure the C code is not list specific and will work with any container supporting __len__(), __getitem__(), and insert().
Do you guys have any objections?
Only that I wished you had done the same for heapq.py. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Raymond]
Dmitry Vasiliev has submitted a C implementation for the bisect module.
My thoughts are to accept it with the following changes:
* leave the original code intact for teaching purposes
+1
* import the C version from _bisect
+1
* make sure the C code is not list specific and will work with any container supporting __len__(), __getitem__(), and insert().
-0.6. "Gratuitous hyper-generality", IMO -- I'm willing to bet nobody has ever written a production class, with that collection of methods, that would also be suitable for using with bisect. "list subtype" is good enough, and will leave the code clearer, quicker, and more maintainable.
Do you guys have any objections?
Six tenths of one, as above <wink>. [Guido]
Only that I wished you had done the same for heapq.py.
It's not too late. IIRC, neither you nor I could make time to review the heapq plans before the C code got checked in. I was able to make a little time after, and mourned the loss of the educational value of the Python version (to Raymond, in pvt). That's a big enough point that it deserves public airing, though: everyone (except Guido, I guess) please remember what it was like when you first learned Python! Mounds and mounds of library code written in Python, doing interesting things with interesting techniques, and written at a high enough level that you weren't wholly buried under details. Python's C code is generally clean and educational too, but has a *much* smaller potential audience than our pure-Python modules. Nobody can sanely accuse me of not caring about speed (see a decade of speed-obsessed Python checkins for counterexamples <wink>), but I cry a little each time a piece of the system gets recoded in C. Keeping the original Python code around is a nice compromise (pioneered, IIRC, by string.py, a loooong time ago).

[Raymond]
Dmitry Vasiliev has submitted a C implementation for the bisect module.
My thoughts are to accept it with the following changes:
* leave the original code intact for teaching purposes * import the C version from _bisect * make sure the C code is not list specific and will work with any container supporting __len__(), __getitem__(), and insert().
Do you guys have any objections?
[Guido]
Only that I wished you had done the same for heapq.py.
It's not too late. IIRC, neither you nor I could make time to review
heapq plans before the C code got checked in. I was able to make a
[Tim] the little
time after, and mourned the loss of the educational value of the Python version (to Raymond, in pvt).
I would be happy to bring back the python version and make it coexist with the C version. IMO, it was a programming pearl unto itself and efforts were already made to preserve the extensive module docstring and make the code match the original line for line with the variable names and logic intact. The bisect module warrants even more consideration because it is more basic and more venerable. [Tim]
Nobody can sanely accuse me of not caring about speed (see a decade of speed-obsessed Python checkins for counterexamples <wink>), but I cry a little each time a piece of the system gets recoded in C. Keeping the original Python code around is a nice compromise (pioneered, IIRC, by string.py, a loooong time ago).
Also, with the PyPy project, there is a chance that cleanly coded pure python modules will survive the test of time longer than their C counterparts. In the itertools module, I documented pure python equivalents to make the documentation more specific, to enchance understanding, and to make it easier to backport. For the most part, that effort was successful. The only downside is the double maintenance problem of keeping the two in sync. Given the stability of bisect, that is not much of an issue. Raymond Hettinger ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

[Raymond]
I would be happy to bring back the python version and make it coexist with the C version.
string.py was the first poster child for this, but it's changed a lot over the years. Originally, string.py was pure Python. As the years went by, it *remained* pure Python, but more and more of the functions got overriden, at the bottom of the file, by importing C functions with the same names. That's kinda lost now, because the module functions turned into string methods, and most of the Python code that remains in string.py is trivial delegation to the string methods now. Several years ago, it all looked a lot like maketrans() looks today (a complete implementation in Python, and then at the bottom that's thrown away). I don't really care whether the Python implementations get tested. And I definitely don't want more messes like pickle+cPickle and StringIO+cStringIO -- they're *always* out of synch in some way, and I doubt that will ever be repaired. Putting them under Demo doesn't do any good, because nobody maintains that directory; we don't even ship that directory in the Windows installer, so putting stuff there renders it invisible to most of Python's users.
IMO, it was a programming pearl unto itself and efforts were already made to preserve the extensive module docstring and make the code match the original line for line with the variable names and logic intact.
Most Python users don't know C, and will never see the C implementation.
... Also, with the PyPy project, there is a chance that cleanly coded pure python modules will survive the test of time longer than their C counterparts. In the itertools module, I documented pure python equivalents to make the documentation more specific, to enchance understanding, and to make it easier to backport. For the most part, that effort was successful.
I love the Python-workalike docs for itertools! That's a very helpful form of documentation, building on Python knowledge to supply a more precise explanation than is possible with informal English. OTOH, the itertools functions have (unlike, e.g., pickle) pretty simple, well-defined jobs. Needing to present half a page (or more) of Python code instead wouldn't make for pleasant docs. For that reason, I was happy to leave datetime.py to die in the sandbox.
The only downside is the double maintenance problem of keeping the two in sync. Given the stability of bisect, that is not much of an issue.
The double-maintenance problem is severe for non-trivial code. The current string module, and bisect module, functions are trivial. heapq is somewhere in the middle, more like the string module used to be. I'd be perfectly happy if heapq.py were restored just as it was, and then a new section added at the bottom to replace it all with C functions. I don't think putting the Python implementations of heapq functions in the docs would be satisfying, because the details of their implementations don't shed light on the *purpose* of the functions. In contrast, the Python implementations of most itertools functions define the purpose of those functions more succinctly than English. The implementations of heapq functions are much more about a specific way of implementing a full binary tree than they are about heap operations.

"Tim" == Tim Peters <tim.one@comcast.net> writes:
Tim> [Raymond] >> I would be happy to bring back the python version and make it coexist >> with the C version. Tim> Putting them under Demo doesn't do any good, because nobody Tim> maintains that directory; we don't even ship that directory in the Tim> Windows installer, so putting stuff there renders it invisible to Tim> most of Python's users. I still don't see why putting the original Python implementations under Demo somewhere is necessarily bad. They would still be available in source distributions and they could be tested from there without being installed - at least on Unix-like systems - by tweaking the "test" target in the Makefile. Something like: test: all platform # first two runs - test all modules which will be installed -find $(srcdir)/Lib -name '*.py[co]' -print | xargs rm -f -$(TESTPYTHON) $(TESTPROG) $(TESTOPTS) $(TESTPYTHON) $(TESTPROG) $(TESTOPTS) # second two runs - as above but force the Python versions of # certain modules to be used -find $(srcdir)/Lib -name '*.py[co]' -print | xargs rm -f PYTHONPATH=$(srcdir)/Demo/lib -$(TESTPYTHON) $(TESTPROG) $(TESTOPTS) PYTHONPATH=$(srcdir)/Demo/lib $(TESTPYTHON) $(TESTPROG) $(TESTOPTS) might be sufficient. (You might have to rename some .so files in certain cases. Also, statically linked modules would pose problems.) >> In the itertools module, I documented pure python equivalents ... Tim> I love the Python-workalike docs for itertools! That's fine, but they can't easily be tested from there. Tim> I'd be perfectly happy if heapq.py were restored just as it was, Tim> and then a new section added at the bottom to replace it all with C Tim> functions. That's also tough to test. I believe there were situations over the years where the Python implementations of some functions in string.py and the corresponding C implementations in stropmodule.c got subtly out-of-sync. In part I think that's because the Python versions never got any exercise during the unit tests. The hack to suppress importing functions from strop would be ugly. I think you'd have to set an environment variable then test its presence in string.py before importing symbols from strop. If we are going to optimize standard library modules by implementing C counterparts but don't want the Python and C implementations to drift apart, we have to be able to test the Python versions somehow. Whatever that mechanism turns out to be has to be fairly convenient to use and shouldn't clutter up Python modules with checks solely used by the test framework. Skip

At 07:54 02.01.2004 -0600, Skip Montanaro wrote:
If we are going to optimize standard library modules by implementing C counterparts but don't want the Python and C implementations to drift apart, we have to be able to test the Python versions somehow. Whatever that mechanism turns out to be has to be fairly convenient to use and shouldn't clutter up Python modules with checks solely used by the test framework.
I think that something like: <heapq.py> heapq python code... try: from _heapq import * # or list the names explicitly except ImportError: pass </heapq.py> can be made to work. The tricky part is more in test_heapq, but I think it can be rewritten in some form like: import heapq # import likely the C code test(heapq) # or heapq could be a global heapq = test.support.get_pure_python_version(heapq) test(heapq) or some variation. Assuming some conventions like _heapq/heapq etc, get_pure_python_version would return a module not in sys.modules populated just with the python code, basically tricking the last import momentarily to fail. Barring some strange intermodule dependency issues which should not be the case for this kind of modules this should be workable. regards.

At 15:47 02.01.2004 +0100, Samuele Pedroni wrote:
import heapq # import likely the C code test(heapq) # or heapq could be a global heapq = test.support.get_pure_python_version(heapq)
oops, that should have been test_support
test(heapq)
here's a tentative minimally tested(*) version of get_pure_python_version: def get_pure_python_version(mod,bltin_mod_name=None): DUMMY = object() mod_name = mod.__name__ if bltin_mod_name is None: bltin_mod_name = '_'+mod_name import inspect,sys,new pure_py_ver = new.module('pure-python-'+mod_name) # ? saved_bltin_mod = sys.modules.get(bltin_mod_name,DUMMY) # import _<mod_name> should fail momentarily sys.modules[bltin_mod_name] = None # or an empty module execfile(inspect.getsourcefile(mod),pure_py_ver.__dict__) if saved_bltin_mod is DUMMY: del sys.modules[bltin_mod_name] else: sys.modules[bltin_mod_name] = saved_bltin_mod return pure_py_ver we could also detect the case were there's no builtin version, which means that the first test likely already tested just the pure Python version. *: tested with: <array2.py> class array(object): def __init__(*args,**kws): pass def __repr__(self): return "<fake pure python array>" try: from array import array except ImportError: pass </array2.py> import array2 ppy = get_pure_python_version(array2)

At 16:31 02.01.2004 +0100, Samuele Pedroni wrote:
import array2 ppy = get_pure_python_version(array2)
when you have a real interactive session transcript why not simply copy it (grr):
import array2 array2.array <type 'array.array'> array2.array('i') array('i') ppy = get_pure_python_version(array2,'array') ppy.array <class 'pure-python-array2.array'> ppy.array('i') <fake pure python array>

I think that something like:
<heapq.py> heapq python code...
try: from _heapq import * # or list the names explicitly except ImportError: pass </heapq.py>
can be made to work.
I like this too.
The tricky part is more in test_heapq, but I think it can be rewritten in some form like:
import heapq # import likely the C code test(heapq) # or heapq could be a global heapq = test.support.get_pure_python_version(heapq) test(heapq)
or some variation.
Note that you could force the _heapq import to fail by putting an explicit None in sys.modules, like so: import heapq ...test the C impl... sys.modules['_heapq'] = None reload(heapq) # Now 'from _heapq import ...' will fail ...test the Python impl... # Restore the default situation del sys.modules['_heapq'] reload(heapq) I wish cStringIO and cPickle would go away, but unfortunately the Python versions have features that the C code don't have. Hopefully the use of new-style classes can avoid this issue for new module pairs. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido]
... I wish cStringIO and cPickle would go away, but unfortunately the Python versions have features that the C code don't have.
Alas, for pickle it's worse than that -- cPickle has lots of undocumented features that pickle.py knows nothing about, and those features are used in Zope. I believe cStringIO also has undocumented features in real use ... yup, cStringIO objects at least have an object.reset() method not in the Python version, equivalent to object.seek(0) (but, I guess, saves the expense of doing a LOAD_CONST on 0 <heh>). cStringIO also provides a C-level API.
Hopefully the use of new-style classes can avoid this issue for new module pairs.
Or you can just stop accepting C code written by Zope Corp <wink>.

Tim> Nobody can sanely accuse me of not caring about speed (see a decade Tim> of speed-obsessed Python checkins for counterexamples <wink>), but Tim> I cry a little each time a piece of the system gets recoded in C. Tim> Keeping the original Python code around is a nice compromise Tim> (pioneered, IIRC, by string.py, a loooong time ago). It would be nice to keep the old Python implementations around for two reasons. First, there's the obvious educational value. Second, the PyPy folks will know where to find it when they need it. I think a Demo/Lib subdirectory (with appropriate README) would be a reasonable place to put such code out to pasture. It could be made available in source distributions but never installed. Skip

On Wed, 2003-12-31 at 00:08, Skip Montanaro wrote:
It would be nice to keep the old Python implementations around for two reasons. First, there's the obvious educational value. Second, the PyPy folks will know where to find it when they need it. I think a Demo/Lib subdirectory (with appropriate README) would be a reasonable place to put such code out to pasture. It could be made available in source distributions but never installed.
The problem is that relegating stuff to Demo would very likely lead to bitrot, which is not a good thing. Another advantage of having current, working (and unit tested) pure-Python alternatives is that they are much easier to modify if you're experimenting with new features, or bug fixes, etc. OTOH, some of the really ancient Python modules will occasionally need attention to bring them up to modern coding standards, idioms, and practices. -Barry

Barry> The problem is that relegating stuff to Demo would very likely Barry> lead to bitrot, which is not a good thing. True. Note, however, that since these files would be in a source distribution there's no reason they couldn't be factored into "make test" somehow. Barry> Another advantage of having current, working (and unit tested) Barry> pure-Python alternatives is that they are much easier to modify Barry> if you're experimenting with new features, or bug fixes, etc. Agreed. Skip

[Skip Montanaro]
It would be nice to keep the old Python implementations around for two reasons. First, there's the obvious educational value. Second, the PyPy folks will know where to find it when they need it. I think a Demo/Lib subdirectory (with appropriate README) would be a reasonable place to put such code out to pasture. It could be made available in source distributions but never installed.
[Barry]
The problem is that relegating stuff to Demo would very likely lead to bitrot, which is not a good thing. Another advantage of having current, working (and unit tested) pure-Python alternatives is that they are much easier to modify if you're experimenting with new features, or bug fixes, etc.
OTOH, some of the really ancient Python modules will occasionally need attention to bring them up to modern coding standards, idioms, and practices.
There a several ways to go: * have two visible modules like StringIO and cStringIO -- personally, I see the approach as clutter but it makes it possible to unittest both. * put the C module behind the scenes and import it into the python module while leaving the old code conditionally excluded or just commented out. * if the code is short, move it to the docs -- the itertools module has pure python equivalents in the docs -- that makes the code accessible and makes the docs more informative -- this might work for bisect and heapq where the amount of code is small. Raymond ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

Raymond Hettinger wrote:
[Barry]
The problem is that relegating stuff to Demo would very likely lead to bitrot, which is not a good thing. Another advantage of having
current,
working (and unit tested) pure-Python alternatives is that they are
much
easier to modify if you're experimenting with new features, or bug fixes, etc.
OTOH, some of the really ancient Python modules will occasionally need attention to bring them up to modern coding standards, idioms, and practices.
There a several ways to go:
* have two visible modules like StringIO and cStringIO -- personally, I see the approach as clutter but it makes it possible to unittest both.
I agree with Raymond on this option; having two versions of the same thing when there is no difference sans performance just clutters up the stdlib. Kind of goes against TIOOWTDI.
* put the C module behind the scenes and import it into the python module while leaving the old code conditionally excluded or just commented out.
But that has the same issue as moving it to Demo since it still won't be used and thus tested.
* if the code is short, move it to the docs -- the itertools module has pure python equivalents in the docs -- that makes the code accessible and makes the docs more informative -- this might work for bisect and heapq where the amount of code is small.
This is the best solution, but I am afraid that for the modules we are discussing this does not work because of the amount of code. I am afraid there is no good solution to this. Either we move the code to a place where it is not really used (such as Demo, but we can add a flag to regrtest.py to test modules in that directory so the bitrot is not horrendous) or not used at all which begins to wear away the usefulness of keeping the code around in the first place. Perhaps we need to just consider the fact that some modules have such beautiful code in Python that we just leave them as such. Not everything needs to be coded in C (and this is in no way meant to not show appreciation to Raymond and anyone else who has converted Python code to C! Some modules do have a basic performance need like bisect that should be dealt with when possible). The main reason I was able to get involved in python-dev was that I was able to jump in and help with the Python code without much of a learning curve since it did not require learning anything new beyond basic practices of the list. Maintainability is a really important thing and each C module makes that a little bit much harder. I think now that this issue has come up it should be enough for people to be aware that some people care enough about the Python code that this probably won't be an issue in the future. If someone has doubts as to whether someone will miss the Python code then they can just ask the list for a quick opinion. But I trust everyone with checkin rights who feels up to replacing a module to not be rash about it so I don't see a need for a vote for every module. As for dealing with the modules that have already been converted, I say put them in Demo. There is no need to do the importing in a Python module wrapper if there was enough of a performance reason to write it in C. And if the code is wanted as reference stick it in the C code's comments. OK, back to the Summary. -Brett

"Raymond Hettinger" <python@rcn.com> writes:
* have two visible modules like StringIO and cStringIO -- personally, I see the approach as clutter but it makes it possible to unittest both.
At least for pickle, and at least historically, pickle has offered something cPickle hasn't: the ability to subclass. Cheers, mwh -- 34. The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Raymond Hettinger wrote:
* if the code is short, move it to the docs -- the itertools module has pure python equivalents in the docs -- that makes the code accessible and makes the docs more informative -- this might work for bisect and heapq where the amount of code is small.
+1 -- Dmitry Vasiliev (dima at hlabs.spb.ru)

On Tue, 2003-12-30 at 22:23, Tim Peters wrote:
That's a big enough point that it deserves public airing, though: everyone (except Guido, I guess) please remember what it was like when you first learned Python! Mounds and mounds of library code written in Python, doing interesting things with interesting techniques, and written at a high enough level that you weren't wholly buried under details. Python's C code is generally clean and educational too, but has a *much* smaller potential audience than our pure-Python modules. Nobody can sanely accuse me of not caring about speed (see a decade of speed-obsessed Python checkins for counterexamples <wink>), but I cry a little each time a piece of the system gets recoded in C. Keeping the original Python code around is a nice compromise (pioneered, IIRC, by string.py, a loooong time ago).
Yes, yes, yes. I'd be bummed if modules like pickle.py got lost or out of sync with its C cousin. -Barry
participants (10)
-
Barry Warsaw
-
Brett
-
Dmitry Vasiliev
-
Guido van Rossum
-
Michael Hudson
-
Raymond Hettinger
-
Raymond Hettinger
-
Samuele Pedroni
-
Skip Montanaro
-
Tim Peters