GSoC: Speed Up 2to3 in Pattern Matching

Hi all, firstly sorry for my length of the proposal. My name is Gombiuda Jiang. Here is my proposal. I am eager to listen to all kind of comments and advices. Thank you for reading my proposal. GSoC: Speed Up 2to3 in Pattern Matching Firstly, thank your for your time to have a glance of my proposal. Idea Background ============================ As PSF pay more and more attention on porting projects using python 2.x to python 3.x, developers in the community start to pay their efforts to port their projects. Luckily, there is a great tool `2to3` helping them in porting job automatically to some extent. However, `2to3` has its limitations. Within these limitations, speed of the conversion can be improved with another implementation to replace the using one. To be more detail, `2to3` is slow because of its using back-tracking of its pattern matching job. Back-tracking is easy to understand and implement, so firstly it's put into used. In addition its implementation is using recursion instead of loop. So as the payload of `2to3` is rocketing up, it eventually shows its weakness of creating more and more frames to allow function calls. As a result, it's necessary to use another algorithm to speed up the pattern matching process. Improvement ============================ As a traditional problem, pattern matching has some efficient algorithms. One of them is named `McNaughton-Yamada-Thompson` algorithm, which bases on NFA(Nondeterministic Finite Automata) theory. This theory can efficiently judge if a string(token stream is also acceptable) can be matched a given regular expression. So it reaches its basic requirements of the pattern matching in `2to3` project. After finishing implementing the algorithm and passing test cases to certificate it works well, we can, if necessary, continue to improve more using DFA theory, which may cost more memory and pattern-compiling time but work faster in matching the patterns. The details is described as below: In original implementation, a pattern string of a fixer is firstly tokenized and then compiled to a pattern syntax tree. Then it starts matching it with the syntax tree of the python 2.x source code with back-tracking. Here we use the original tokenizing result, then put it into a `NFAGenerator` to generate the NFA. At last, we put a node of the syntax tree which is generated from the python 2.x source code into the NFA, and output the result of `True` or `False`. The left process(replacement process) hasn't change. So the changed things are a new `NFAGenerator` class, a new `NFAPattern` class to represent the result of the generation result of the `NFAGenerator` class. Here is the model of the two classes: class NFAPattern(object): """ Attributes: -`content`: The original pattern to be matched -`graph`: A 2D array to store the transition diagram Methods: -`optimize(self)`: Maybe implement later using DFA or some graph reduction algorithms -`match(self, node, result)`: Travels the tree to find out all matched node and put them into result to be passed out """ def __init__(self, content): self.graph = None self.content = content def optimize(self): # Maybe implement later using DFA or some graph reduction algorithms pass def match(self, node, result=None): # Find out all matched node(as root) and put them into result to be passed out pass class NFAGenerator(object): """ Basically the same as the original one `PatternCompiler` except of the implementation of the compile_node() method. """ def __init__(self): # implement like the original one in `PatternCompiler` pass def compile_pattern(self, input, debug=False): tokens = tokenize_wrapper(input) root = self.driver.parse_tokens(tokens, debug=debug) return self.compile_node(root) def compile_node(self, node): # implement base on `McNaughton-Yamada-Thompson` algorithm to compile a pattern string into a NFA pass With this improvement, the process will be more faster than before. But there still contains protential to improve. Right, that's pararrelization. Simplely, we can imagine that we have a bunch of files to be converted. Multithread can be easily used because there is no sequence-relationship between each file. As a result, the changes would be: # Inside refactor.py import threading class RefactoringTool(object): # .... THREADS = 10 def refactor_dir(self, dir_name, write=False, doctests_only=False): """ """ cl = threading.Condition(threading.Lock()) def generate_filenames(dir_name): for dirpath, dirnames, filenames in os.walk(dir_name): for name in filenames: if not name.startswith(".") and name.endswith("py"): fullname = os.path.join(dirpath, name) yield fullname def refacotr_one_file(iter): while True: cl.wait() cl.acquire() fullname = iter.next() cl.release() if fullname: self.refactor_file(fullname, write, doctests_only) else: break iter = generate_filenames(dir_name) threads = [] for i in range(THREADS): threads.append(threading.Thread(target=refacotr_one_file, args=(iter,))) for thread in threads: thread.start() # .... Time Schedule of GSoC ============================ April 27~May 6: Communicate with the mentor to get more requirement information May 7~May 16: Read more documentations to make sure the implementation. May 17~May 26: Write the structure of the new implementation May 27~June 6: Write test cases for testing and write benchmark tool for checking out the efficiency. June 7~July 6: Write the new implementation. July 7~July 11: Testing and bug fixing. July 12~July 16: Mid-term evaluation July 17~July 26: Improvement from NFA to DFA or just reduce the graph of the NFA or add multithread support July 27~August 6: Scrub code and test August 7~August 16: Write documentations August 17: Pencil down About Me ============================ Name: Gombiuda Jiang University: Sun-Yat sen university of China Major: Software Engineering Email/Gtalk: gombiuda@gmail.com Knowledge Background: - 9 years programming experience with 6 years focusing on algorithm and data structure learning and 3 years software development experience. - Some experience of researching face detection and recognition using Hidden-Markov-Model algorithm. - Some experience of researching text clustering with artifical intelligence. - Fundermental courses at software school, like `Software Engineering`, `Operating System`, ... Prizes: - The first prize NOIP of primary group district and thus roll in high school with bonus point added - The first prize NOIP of high school group district and thus recommended to the Sun-Yat sen university - The third prize `Software Inovation Competition` at Sun-Yat sen university - The second prize `Commercial Software Inovation Competition` which is hold by Citizen Bank Development Experience: - A RSS news reader using C++ in 2007 - A commercial website in Citizen Bank competition using Java SSH - A educational online-judge website using python and django which is put into use in our software school. - Some small tools like file format convertor, auto voter via Internet, cheater of the Travian game, and so on. Why Me? ============================ I have been using Linux as my default system for years and I like open source very much. I do want to contribute my energy to the community continuously. I have been using python for nearly 2 years and I like it very much for is efficiency. So I read part of the source code of python and got to realize its internal structure. As a result, I'd like to take part in developing the python projects, especially the core and the built-in libraries. And with a strong algorithm ability, I think I can manage to do it! Future View ============================ I would like to join the group of developing python with the help of this project's success. If you see my protential, please let me in, then I will try my best to learn and work hard, not making you disappoint. Finally, thanks again for reading my proposal. -- Sincerely, Gombiuda Jiang
participants (1)
-
gombiuda JHL