Speeding up 2to3: Results from a GSOC Project

Hi everyone, I haven't had a chance to introduce myself yet. I'm George Boutsioukis, a CS student from Greece and I'm currently enrolled as a GSOC student for the PSF. The task I was involved with for the past few weeks was speeding up the 2to3 tool. For those who are not aware of 2to3's internals, the tool matches a series of tree patterns to a python syntax tree and applies a code transformation on each match. The real bottleneck of this tool is the tree pattern matching algorithm, since the current version traverses all nodes of the tree, top-to-bottom, and checks each node against a set of tree patterns. The new algorithm I've implemented reduces the candidate nodes for matching. The process works in three steps: 1) Each pattern tree is reduced to the most 'characteristic' path, i.e. the rarest path you'll encounter on a real code tree. (this is simpler than it sounds) 2) From the set of the above linear patterns an Aho-Corasick automaton is constructed that is capable of inclusively matching all patterns(meaning that it may mark a wrong candidate but will never miss an instance) 3) The automaton is run on each leaf until the root of the tree is reached and the resulting match set is returned. And finally we apply each fixer for each node in the match set. Of course the process is a bit more complicated since we have to recheck the transformed code for new matches etc. Since it is quite a hard concept to illustrate in a post, tell me if you have any questions or need more info. The benefit of this somewhat complex process is substantial; the total run time is reduced by roughly 50%. The new module is still experimental code, but mature enough to pass all tests. I propose to include the new module with the next version of Python, perhaps as an explicit option for 2to3 until it can be thoroughly tested. The code repository can be found here: https://code.google.com/p/2to3-speedup2/source/browse/ (the code is a fork from py3k's 2to3 and was tested with version 3.1.2) The blog I've been using for GSOC(containing more details on the matching algorithm): http://george-gsoc.blogspot.com/ Regards, George

Great! CS FTW! --Guido On Fri, Jul 30, 2010 at 1:28 PM, George Boutsioukis <gboutsioukis@gmail.com> wrote:
Hi everyone, I haven't had a chance to introduce myself yet. I'm George Boutsioukis, a CS student from Greece and I'm currently enrolled as a GSOC student for the PSF. The task I was involved with for the past few weeks was speeding up the 2to3 tool.
For those who are not aware of 2to3's internals, the tool matches a series of tree patterns to a python syntax tree and applies a code transformation on each match. The real bottleneck of this tool is the tree pattern matching algorithm, since the current version traverses all nodes of the tree, top-to-bottom, and checks each node against a set of tree patterns.
The new algorithm I've implemented reduces the candidate nodes for matching. The process works in three steps:
1) Each pattern tree is reduced to the most 'characteristic' path, i.e. the rarest path you'll encounter on a real code tree. (this is simpler than it sounds)
2) From the set of the above linear patterns an Aho-Corasick automaton is constructed that is capable of inclusively matching all patterns(meaning that it may mark a wrong candidate but will never miss an instance)
3) The automaton is run on each leaf until the root of the tree is reached and the resulting match set is returned. And finally we apply each fixer for each node in the match set. Of course the process is a bit more complicated since we have to recheck the transformed code for new matches etc.
Since it is quite a hard concept to illustrate in a post, tell me if you have any questions or need more info.
The benefit of this somewhat complex process is substantial; the total run time is reduced by roughly 50%. The new module is still experimental code, but mature enough to pass all tests. I propose to include the new module with the next version of Python, perhaps as an explicit option for 2to3 until it can be thoroughly tested.
The code repository can be found here: https://code.google.com/p/2to3-speedup2/source/browse/ (the code is a fork from py3k's 2to3 and was tested with version 3.1.2)
The blog I've been using for GSOC(containing more details on the matching algorithm): http://george-gsoc.blogspot.com/
Regards, George
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)

Love it! BTW, it's not a good idea to have an import statement under 3 level of loops: https://code.google.com/p/2to3-speedup2/source/browse/trunk/lib2to3/refactor... -- Alexandre
participants (3)
-
Alexandre Vassalotti
-
George Boutsioukis
-
Guido van Rossum