[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 10:21:06 CEST 2005

Am 05.10.2005 um 00:08 schrieb Martin v. Löwis:

> Walter Dörwald wrote:
>
>>> This array would have to be sparse, of course.
>>>
>> For encoding yes, for decoding no.
>>
> [...]
>
>> For decoding it should be sufficient to use a unicode string of   
>> length 256. u"\ufffd" could be used for "maps to undefined". Or  
>> the  string might be shorter and byte values greater than the  
>> length of  the string are treated as "maps to undefined" too.
>
> Right. That's what I meant with "sparse": you somehow need to  
> represent
> "no value".

OK, but I don't think that we really need a sparse data structure for  
that. I used the following script to check that:
-----
import sys, os.path, glob, encodings

has = 0
hasnt = 0

for enc in glob.glob("%s/*.py" % os.path.dirname(encodings.__file__)):
   enc = enc.rsplit(".")[-2].rsplit("/")[-1]
   try:
     __import__("encodings.%s" % enc)
     codec = sys.modules["encodings.%s" % enc]
   except:
     pass
   else:
     if hasattr(codec, "decoding_map"):
       print codec
       for i in xrange(0, 256):
         if codec.decoding_map.get(i, None) is not None:
           has += 1
         else:
           hasnt += 1
print "assigned values:", has, "unassigned values:", hasnt
----
It reports that in all the charmap codecs there are 15292 assigned  
byte values and only 324 unassigned ones. I.e. only about 2% of the  
byte values map to "undefined". Storing those codepoints in the array  
as U+FFFD would only need 648 (or 1296 for wide builds) additional  
bytes. I don 't think a sparse data structure could beat that.

>> This might work, although nobody has complained about charmap   
>> encoding yet. Another option would be to generate a big switch   
>> statement in C and let the compiler decide about the best data   
>> structure.
> I would try to avoid generating C code at all costs. Maintaining  
> the build processes will just be a nightmare.

Sounds resonable.

Bye,
    Walter Dörwald