I looked at it a bit when Tcl 8.1 was in beta; it derives from Henry Spencer's 1998-vintage code, which seems to try to do a lot of optimization and analysis. It may even compile DFAs instead of NFAs when possible, though it's hard for me to be sure. This might give it a substantial speed advantage over engines that do less analysis, but I haven't benchmarked it. The code is easy to read, but difficult to understand because the theory underlying the analysis isn't explained in the comments; one feels there should be an accompanying paper to explain how everything works, and it's why I'm not sure if it really is producing DFAs for some expressions.
Tcl seems to represent everything as UTF-8 internally, so there's only one regex engine; there's .
Hmm... I looked when Tcl 8.1 was in alpha, and I *think* that at that point the regex engine was compiled twice, once for 8-bit chars and once for 16-bit chars. But this may have changed. I've noticed that Perl is taking the same position (everything is UTF-8 internally). On the other hand, Java distinguishes 16-bit chars from 8-bit bytes. Python is currently in the Java camp. This might be a good time to make sure that we're still convinced that this is the right thing to do!
The code is scattered over more files:
amarok generic>ls re*.[ch] regc_color.c regc_locale.c regcustom.h regerrs.h regfree.c regc_cvec.c regc_nfa.c rege_dfa.c regex.h regfronts.c regc_lex.c regcomp.c regerror.c regexec.c regguts.h amarok generic>wc -l re*.[ch] 742 regc_color.c 170 regc_cvec.c 1010 regc_lex.c 781 regc_locale.c 1528 regc_nfa.c 2124 regcomp.c 85 regcustom.h 627 rege_dfa.c 82 regerror.c 18 regerrs.h 308 regex.h 952 regexec.c 25 regfree.c 56 regfronts.c 388 regguts.h 8896 total amarok generic>
This would be an issue for using it with Python, since all these files would wind up scattered around the Modules directory. For comparison, pypcre.c is around 4700 lines of code.
I'm sure that if it's good code, we'll find a way. Perhaps a more interesting question is whether it is Perl5 compatible. I contacted Henry Spencer at the time and he was willing to let us use his code. --Guido van Rossum (home page: http://www.python.org/~guido/)