[Python-Dev] re with Unicode broken?

Sjoerd Mullender sjoerd.mullender@oratrix.com
Fri, 13 Jul 2001 17:54:07 +0200


On Fri, Jul 13 2001 "Fredrik Lundh" wrote:

> sjoerd wrote:
> 
> > This is not for the faint of heart.
> >
> > My validating XML parser doesn't work anymore, even though I didn't
> > change a thing (except update Python from CVS).
> 
> when did you last update without problems?

I have no idea.  I update regularly (only on the main branch), but I
don't run the program very often.

> the likely cause for this is MvL's "big char set" patch, which
> I checked in on July 6.
> 
> here's a workaround: tweak sre_compile.py so it doesn't generate
> BIGCHARSET op codes. in _optimize_charset, change this:
> 
>     except IndexError:
>         # character set contains unicode characters
>         return _optimize_unicode(charset, fixup)
>     # compress character map
> 
> to
> 
>     except IndexError:
>         # character set contains unicode characters
>         return charset # WORKAROUND: no compression
>     # compress character map
> 
> I'll look into this over the weekend.

Yes, this works.


While you're looking at this, maybe you can also look at speeding up
stuff?  :-)

Importing the module with my XML parser takes an inordinate amount of
time.  This is entirely due to compiling all the regular expressions.
There are a lot of them, and since many of them use the _Name pattern
that I included in my previous message, they tend to be big.

Unfortunately, I can't use any abbreviations that re might provide for
Unicode character sets, since then I don't know for sure that my
expressions are compatible with the XML definition.

Maybe it's possible to add a way of saving precompiled expressions in
the Python file?
-- Sjoerd Mullender <sjoerd.mullender@oratrix.com>