[Python-Dev] Unicode charmap decoders slow

Walter Dörwald walter at livinglogic.de
Wed Oct 5 10:21:06 CEST 2005

Am 05.10.2005 um 00:08 schrieb Martin v. Löwis:

> Walter Dörwald wrote:
>>> This array would have to be sparse, of course.
>> For encoding yes, for decoding no.
> [...]
>> For decoding it should be sufficient to use a unicode string of   
>> length 256. u"\ufffd" could be used for "maps to undefined". Or  
>> the  string might be shorter and byte values greater than the  
>> length of  the string are treated as "maps to undefined" too.
> Right. That's what I meant with "sparse": you somehow need to  
> represent
> "no value".

OK, but I don't think that we really need a sparse data structure for  
that. I used the following script to check that:
import sys, os.path, glob, encodings

has = 0
hasnt = 0

for enc in glob.glob("%s/*.py" % os.path.dirname(encodings.__file__)):
   enc = enc.rsplit(".")[-2].rsplit("/")[-1]
     __import__("encodings.%s" % enc)
     codec = sys.modules["encodings.%s" % enc]
     if hasattr(codec, "decoding_map"):
       print codec
       for i in xrange(0, 256):
         if codec.decoding_map.get(i, None) is not None:
           has += 1
           hasnt += 1
print "assigned values:", has, "unassigned values:", hasnt
It reports that in all the charmap codecs there are 15292 assigned  
byte values and only 324 unassigned ones. I.e. only about 2% of the  
byte values map to "undefined". Storing those codepoints in the array  
as U+FFFD would only need 648 (or 1296 for wide builds) additional  
bytes. I don 't think a sparse data structure could beat that.

>> This might work, although nobody has complained about charmap   
>> encoding yet. Another option would be to generate a big switch   
>> statement in C and let the compiler decide about the best data   
>> structure.
> I would try to avoid generating C code at all costs. Maintaining  
> the build processes will just be a nightmare.

Sounds resonable.

    Walter Dörwald

More information about the Python-Dev mailing list