[Python-Dev] Unicode charmap decoders slow
Walter Dörwald
walter at livinglogic.de
Wed Oct 5 10:21:06 CEST 2005
Am 05.10.2005 um 00:08 schrieb Martin v. Löwis:
> Walter Dörwald wrote:
>
>>> This array would have to be sparse, of course.
>>>
>> For encoding yes, for decoding no.
>>
> [...]
>
>> For decoding it should be sufficient to use a unicode string of
>> length 256. u"\ufffd" could be used for "maps to undefined". Or
>> the string might be shorter and byte values greater than the
>> length of the string are treated as "maps to undefined" too.
>
> Right. That's what I meant with "sparse": you somehow need to
> represent
> "no value".
OK, but I don't think that we really need a sparse data structure for
that. I used the following script to check that:
-----
import sys, os.path, glob, encodings
has = 0
hasnt = 0
for enc in glob.glob("%s/*.py" % os.path.dirname(encodings.__file__)):
enc = enc.rsplit(".")[-2].rsplit("/")[-1]
try:
__import__("encodings.%s" % enc)
codec = sys.modules["encodings.%s" % enc]
except:
pass
else:
if hasattr(codec, "decoding_map"):
print codec
for i in xrange(0, 256):
if codec.decoding_map.get(i, None) is not None:
has += 1
else:
hasnt += 1
print "assigned values:", has, "unassigned values:", hasnt
----
It reports that in all the charmap codecs there are 15292 assigned
byte values and only 324 unassigned ones. I.e. only about 2% of the
byte values map to "undefined". Storing those codepoints in the array
as U+FFFD would only need 648 (or 1296 for wide builds) additional
bytes. I don 't think a sparse data structure could beat that.
>> This might work, although nobody has complained about charmap
>> encoding yet. Another option would be to generate a big switch
>> statement in C and let the compiler decide about the best data
>> structure.
> I would try to avoid generating C code at all costs. Maintaining
> the build processes will just be a nightmare.
Sounds resonable.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list