Fredrik Lundh writes:
-- regexps: has anyone compared the new uni- code-aware regexp package in Tcl with pcre?
I looked at it a bit when Tcl 8.1 was in beta; it derives from Henry Spencer's 1998-vintage code, which seems to try to do a lot of optimization and analysis. It may even compile DFAs instead of NFAs when possible, though it's hard for me to be sure. This might give it a substantial speed advantage over engines that do less analysis, but I haven't benchmarked it. The code is easy to read, but difficult to understand because the theory underlying the analysis isn't explained in the comments; one feels there should be an accompanying paper to explain how everything works, and it's why I'm not sure if it really is producing DFAs for some expressions. Tcl seems to represent everything as UTF-8 internally, so there's only one regex engine; there's . The code is scattered over more files: amarok generic>ls re*.[ch] regc_color.c regc_locale.c regcustom.h regerrs.h regfree.c regc_cvec.c regc_nfa.c rege_dfa.c regex.h regfronts.c regc_lex.c regcomp.c regerror.c regexec.c regguts.h amarok generic>wc -l re*.[ch] 742 regc_color.c 170 regc_cvec.c 1010 regc_lex.c 781 regc_locale.c 1528 regc_nfa.c 2124 regcomp.c 85 regcustom.h 627 rege_dfa.c 82 regerror.c 18 regerrs.h 308 regex.h 952 regexec.c 25 regfree.c 56 regfronts.c 388 regguts.h 8896 total amarok generic> This would be an issue for using it with Python, since all these files would wind up scattered around the Modules directory. For comparison, pypcre.c is around 4700 lines of code. -- A.M. Kuchling http://starship.python.net/crew/amk/ Things need not have happened to be true. Tales and dreams are the shadow-truths that will endure when mere facts are dust and ashes, and forgot. -- Neil Gaiman, _Sandman_ #19: _A Midsummer Night's Dream_