[Python-Dev] Why Foo is better than Baz

Guido van Rossum guido at CNRI.Reston.VA.US
Mon May 3 17:32:09 CEST 1999


> 	I looked at it a bit when Tcl 8.1 was in beta; it derives from
> Henry Spencer's 1998-vintage code, which seems to try to do a lot of
> optimization and analysis.  It may even compile DFAs instead of NFAs
> when possible, though it's hard for me to be sure.  This might give it
> a substantial speed advantage over engines that do less analysis, but
> I haven't benchmarked it.  The code is easy to read, but difficult to
> understand because the theory underlying the analysis isn't explained
> in the comments; one feels there should be an accompanying paper to
> explain how everything works, and it's why I'm not sure if it really
> is producing DFAs for some expressions.
> 
> 	Tcl seems to represent everything as UTF-8 internally, so
> there's only one regex engine; there's .

Hmm...  I looked when Tcl 8.1 was in alpha, and I *think* that at that 
point the regex engine was compiled twice, once for 8-bit chars and
once for 16-bit chars.  But this may have changed.

I've noticed that Perl is taking the same position (everything is
UTF-8 internally).  On the other hand, Java distinguishes 16-bit chars 
from 8-bit bytes.  Python is currently in the Java camp.  This might
be a good time to make sure that we're still convinced that this is
the right thing to do!

> The code is scattered over
> more files:
> 
> amarok generic>ls re*.[ch]
> regc_color.c    regc_locale.c   regcustom.h     regerrs.h       regfree.c
> regc_cvec.c     regc_nfa.c      rege_dfa.c      regex.h         regfronts.c
> regc_lex.c      regcomp.c       regerror.c      regexec.c       regguts.h
> amarok generic>wc -l re*.[ch]
>      742 regc_color.c
>      170 regc_cvec.c
>     1010 regc_lex.c
>      781 regc_locale.c
>     1528 regc_nfa.c
>     2124 regcomp.c
>       85 regcustom.h
>      627 rege_dfa.c
>       82 regerror.c
>       18 regerrs.h
>      308 regex.h
>      952 regexec.c
>       25 regfree.c
>       56 regfronts.c
>      388 regguts.h
>     8896 total
> amarok generic>
> 
> 	This would be an issue for using it with Python, since all
> these files would wind up scattered around the Modules directory.  For
> comparison, pypcre.c is around 4700 lines of code.

I'm sure that if it's good code, we'll find a way.  Perhaps a more
interesting question is whether it is Perl5 compatible.  I contacted
Henry Spencer at the time and he was willing to let us use his code.

--Guido van Rossum (home page: http://www.python.org/~guido/)





More information about the Python-Dev mailing list