[Patches] [ python-Patches-417084 ] sre: Speed up Unicode charsets

Wed, 09 May 2001 09:19:18 -0700

Patches item #417084, was updated on 2001-04-18 08:34
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=417084&group_id=5470

Category: core (C code)
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Martin v. Löwis (loewis)
>Assigned to: Fredrik Lundh (effbot)
Summary: sre: Speed up Unicode charsets

Initial Comment:
When matching large Unicode charsets (e.g. the one that
defines an XML name, from xml.utils.characters),
matching is quite slow, since a linear search over the
ranges is performed.

The patch compiles a unicode character class into a
BIGCHARSET opcode, using a compression technique
similar to the one that the expat parser uses (see
comment in sre_parse).

With the patch, runtime for

import time,re,xml.utils.characters
u = u"Hallo welt"
e = xml.utils.characters.re_Name
t = time.time()
for i in xrange(1000000):
    e.match(u)
print time.time()-t

could be reduced to 45%. Even when doing full parsing
using CVS xmlproc, a 4% speedup can still be observed.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=417084&group_id=5470