Guido van Rossum writes:
Hmm... I looked when Tcl 8.1 was in alpha, and I *think* that at that point the regex engine was compiled twice, once for 8-bit chars and once for 16-bit chars. But this may have changed.
It doesn't seem to currently; the code in tclRegexp.c looks like this: /* Remember the UTF-8 string so Tcl_RegExpRange() can convert the * matches from character to byte offsets. */ regexpPtr->string = string; Tcl_DStringInit(&stringBuffer); uniString = Tcl_UtfToUniCharDString(string, -1, &stringBuffer); numChars = Tcl_DStringLength(&stringBuffer) / sizeof(Tcl_UniChar); /* Perform the regexp match. */ result = TclRegExpExecUniChar(interp, re, uniString, numChars, -1, ((string > start) ? REG_NOTBOL : 0)); ISTR the Spencer engine does, however, define a small and large representation for NFAs and have two versions of the engine, one for each representation. Perhaps that's what you're thinking of.
I've noticed that Perl is taking the same position (everything is UTF-8 internally). On the other hand, Java distinguishes 16-bit chars from 8-bit bytes. Python is currently in the Java camp. This might be a good time to make sure that we're still convinced that this is the right thing to do!
I don't know. There's certainly the fundamental dichotomy that strings are sometimes used to represent characters, where changing encodings on input and output is reasonably, and sometimes used to hold chunks of binary data, where any changes are incorrect. Perhaps Paul Prescod is right, and we should try to get some other data type (array.array()) for holding binary data, as distinct from strings.
I'm sure that if it's good code, we'll find a way. Perhaps a more interesting question is whether it is Perl5 compatible. I contacted Henry Spencer at the time and he was willing to let us use his code.
Mostly Perl-compatible, though it doesn't look like the 5.005 features are there, and I haven't checked for every single 5.004 feature. Adding missing features might be problematic, because I don't really understand what the code is doing at a high level. Also, is there a user community for this code? Do any other projects use it? Philip Hazel has been quite helpful with PCRE, an important thing when making modifications to the code. Should I make a point of looking at what using the Spencer engine would entail? It might not be too difficult (an evening or two, maybe?) to write a re.py that sat on top of the Spencer code; that would at least let us do some benchmarking. -- A.M. Kuchling http://starship.python.net/crew/amk/ In Einstein's theory of relativity the observer is a man who sets out in quest of truth armed with a measuring-rod. In quantum theory he sets out with a sieve. -- Sir Arthur Eddington