Unicode charmap decoders slow

Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8 than going through unicode()? I'm writing a small card-file program. As a test, I use a 53 MB MBox file, in mac-roman encoding. My program reads and parses the file into messages in about 3 to 5 seconds (Wow! Go Python!), but takes about 14 seconds to iterate over the cards and convert them to utf-8: for i in xrange(len(cards)): u = unicode(cards[i], encoding) cards[i] = u.encode('utf-8') The time is nearly all in the unicode() call. It's not so much how much time it takes, but that it takes 4 times as long as the real work, just to do table lookups. Looking at the source (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a dictionary lookup for each character. I would have thought that it would make and cache a LUT the size of the charmap (and hook the relevent dictionary stuff to delete the cached LUT if the dictionary is changed). (You may consider this a request for enhancement. ;) I thought of using U"".translate(), but the unicode version is defined to be slow, and anyway I can't find any way to just shove my 8-bit data into a unicode string without translation. Is there some similar approach? I'm almost (but not quite) ready to try it in Pyrex. I'm new to Python. I didn't google anything relevent on python.org or in groups. I posted this in comp.lang.python yesterday, got a couple of responses, but I think this may be too sophisticated a question for that group. I'm not a member of this list, so please copy me on replies so I don't have to hunt them down in the archive. ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is very slow compared to encoding or decoding with utf-8. Here I'm working with 53k of data instead of 53 megs. (Note: this is a laptop, so it's possible that thermal or battery management features affected these numbers a bit, but by a factor of 3 at most) $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')" 1000 loops, best of 3: 591 usec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 1.25 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 13.5 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')" 100 loops, best of 3: 13.6 msec per loop With utf-8 encoding as the baseline, we have decode('utf-8') 2.1x as long decode('mac-roman') 22.8x as long decode('iso8859-1') 23.0x as long Perhaps this is an area that is ripe for optimization. Jeff

Am 04.10.2005 um 04:25 schrieb jepler@unpythonic.net:
As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is very slow compared to encoding or decoding with utf-8. Here I'm working with 53k of data instead of 53 megs. (Note: this is a laptop, so it's possible that thermal or battery management features affected these numbers a bit, but by a factor of 3 at most)
$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')" 1000 loops, best of 3: 591 usec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 1.25 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 13.5 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')" 100 loops, best of 3: 13.6 msec per loop
With utf-8 encoding as the baseline, we have decode('utf-8') 2.1x as long decode('mac-roman') 22.8x as long decode('iso8859-1') 23.0x as long
Perhaps this is an area that is ripe for optimization.
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary. Or we could implement this array as a C array (i.e. gencodec.py would generate C code). Bye, Walter Dörwald

At 9:37 AM +0200 10/4/05, Walter Dörwald wrote:
Am 04.10.2005 um 04:25 schrieb jepler@unpythonic.net:
As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is very slow compared to encoding or decoding with utf-8. Here I'm working with 53k of data instead of 53 megs. (Note: this is a laptop, so it's possible that thermal or battery management features affected these numbers a bit, but by a factor of 3 at most)
$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')" 1000 loops, best of 3: 591 usec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 1.25 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 13.5 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')" 100 loops, best of 3: 13.6 msec per loop
With utf-8 encoding as the baseline, we have decode('utf-8') 2.1x as long decode('mac-roman') 22.8x as long decode('iso8859-1') 23.0x as long
Perhaps this is an area that is ripe for optimization.
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.
Or we could implement this array as a C array (i.e. gencodec.py would generate C code).
Fine -- as long as it still allows changing code points. I add the missing "Apple logo" code point to mac-roman in order to permit round-tripping (0xF0 <=> 0xF8FF, per Apple docs). (New bug #1313051.) If an all-C implementation wouldn't permit changing codepoints, I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement. But is there really no way to say this fast in pure Python? The way a one-to-one byte mapping can be done with "".translate()? ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Tony Nelson wrote:
But is there really no way to say this fast in pure Python? The way a one-to-one byte mapping can be done with "".translate()?
Well, .translate isn't exactly pure Python. One-to-one between bytes and Unicode code points simply can't work. Just try all alternatives yourself and see if you can get any better than charmap_decode. Some would argue that charmap_decode *is* fast. Regards, Martin

Walter Dörwald wrote:
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.
This array would have to be sparse, of course. Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to validate the type code, and to get ob_item).
Or we could implement this array as a C array (i.e. gencodec.py would generate C code).
For decoding, we would not get any better than array.array, except for startup cost. For encoding, having a C trie might give considerable speedup. _codecs could offer an API to convert the current dictionaries into lookup-efficient structures, and the conversion would be done when importing the codec. For the trie, two levels (higher and lower byte) would probably be sufficient: I believe most encodings only use 2 "rows" (256 code point blocks), very few more than three. Regards, Martin

Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:
Walter Dörwald wrote:
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.
This array would have to be sparse, of course.
For encoding yes, for decoding no.
Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to validate the type code, and to get ob_item).
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
Or we could implement this array as a C array (i.e. gencodec.py would generate C code).
For decoding, we would not get any better than array.array, except for startup cost.
Yes.
For encoding, having a C trie might give considerable speedup. _codecs could offer an API to convert the current dictionaries into lookup-efficient structures, and the conversion would be done when importing the codec.
For the trie, two levels (higher and lower byte) would probably be sufficient: I believe most encodings only use 2 "rows" (256 code point blocks), very few more than three.
This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure. Bye, Walter Dörwald

Walter Dörwald wrote:
This array would have to be sparse, of course.
For encoding yes, for decoding no. [...] For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
Right. That's what I meant with "sparse": you somehow need to represent "no value".
This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure.
I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare. Regards, Martin

Am 05.10.2005 um 00:08 schrieb Martin v. Löwis:
Walter Dörwald wrote:
This array would have to be sparse, of course.
For encoding yes, for decoding no.
[...]
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
Right. That's what I meant with "sparse": you somehow need to represent "no value".
OK, but I don't think that we really need a sparse data structure for that. I used the following script to check that: ----- import sys, os.path, glob, encodings has = 0 hasnt = 0 for enc in glob.glob("%s/*.py" % os.path.dirname(encodings.__file__)): enc = enc.rsplit(".")[-2].rsplit("/")[-1] try: __import__("encodings.%s" % enc) codec = sys.modules["encodings.%s" % enc] except: pass else: if hasattr(codec, "decoding_map"): print codec for i in xrange(0, 256): if codec.decoding_map.get(i, None) is not None: has += 1 else: hasnt += 1 print "assigned values:", has, "unassigned values:", hasnt ---- It reports that in all the charmap codecs there are 15292 assigned byte values and only 324 unassigned ones. I.e. only about 2% of the byte values map to "undefined". Storing those codepoints in the array as U+FFFD would only need 648 (or 1296 for wide builds) additional bytes. I don 't think a sparse data structure could beat that.
This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure. I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.
Sounds resonable. Bye, Walter Dörwald

Martin v. Löwis wrote:
Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure.
I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.
We could automate this using distutils; however I'm not sure whether this would then also work on Windows. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.
We could automate this using distutils; however I'm not sure whether this would then also work on Windows.
It wouldn't.
Could you elaborate why not ? Using distutils on Windows is really easy... -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
It wouldn't.
Could you elaborate why not ? Using distutils on Windows is really easy...
The current build process for Windows simply doesn't provide it. You expect to select "Build/All" from the menu (or some such), and expect all code to be compiled. The VC build process only considers VC project files. Maybe it is possible to hack up a project file to invoke distutils as the build process, but no such project file is currently available, nor is it known whether it is possible to create one. Whatever the build process: it should properly with debug and release build, with alternative compilers (such as Itanium compiler), and place the files so that debugging from the VStudio environment is possible. All of this is not the case of today, and nobody has worked on making it possible. I very much doubt distutils in its current form could handle it. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
It wouldn't.
Could you elaborate why not ? Using distutils on Windows is really easy...
The current build process for Windows simply doesn't provide it. You expect to select "Build/All" from the menu (or some such), and expect all code to be compiled. The VC build process only considers VC project files.
Maybe it is possible to hack up a project file to invoke distutils as the build process, but no such project file is currently available, nor is it known whether it is possible to create one. Whatever the build process: it should properly with debug and release build, with alternative compilers (such as Itanium compiler), and place the files so that debugging from the VStudio environment is possible. All of this is not the case of today, and nobody has worked on making it possible. I very much doubt distutils in its current form could handle it.
I see, so you have to create a VC project file for each codec - that would be hard to maintain indeed. For Unix platforms this would be no problem at all since there all extension modules are built using distutils anyway. Thanks for the explanation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[Martin v. Loewis wrote]
Maybe it is possible to hack up a project file to invoke distutils as the build process, but no such project file is currently available, nor is it known whether it is possible to create one.
This is essentially what the "_ssl" project does, no? It defers to "build_ssl.py" to do the build work. I didn't see what the full build requirements were earlier in this thread though, so I may be missing something. Trent -- Trent Mick TrentM@ActiveState.com

Trent Mick wrote:
[Martin v. Loewis wrote]
Maybe it is possible to hack up a project file to invoke distutils as the build process, but no such project file is currently available, nor is it known whether it is possible to create one.
This is essentially what the "_ssl" project does, no?
More or less, yes. It does support both debug and release build. It does not support Itanium builds (atleast not the way the other projects do); as a result, the Itanium build currently just doesn't offer SSL. More importantly, build_ssl.py is not based on distutils. Instead, it is manually hacked up - a VBScript file would have worked as well. So if you were to create many custom build scripts (one per codec), you might just as well generate the VS project files directly. Regards, Martin

[Recipient list not trimmed, as my replies must be vetted by a moderator, which seems to delay them. :] At 11:48 PM +0200 10/4/05, Walter Dörwald wrote:
Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:
Walter Dörwald wrote:
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.
This array would have to be sparse, of course.
For encoding yes, for decoding no.
Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to validate the type code, and to get ob_item).
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values. There is no need to add any C code to the codecs; just add some more code to the existing C function (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c). ...
For encoding, having a C trie might give considerable speedup. _codecs could offer an API to convert the current dictionaries into lookup-efficient structures, and the conversion would be done when importing the codec.
For the trie, two levels (higher and lower byte) would probably be sufficient: I believe most encodings only use 2 "rows" (256 code point blocks), very few more than three.
This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure.
I'm willing to complain. :) I might allow saving of my (53 MB) MBox file. (Not that editing received mail makes as much sense as searching it.) Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding. At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote:
I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.
I agree; also I don't think the generated codecs need to be changed at all. All the changes can be made to the existing C functions, by adding caching per a reply of mine that hasn't made it to the list yet. Well, OK, something needs to hook writes to the codec's dictionary, but I /think/ that only needs Python code. I say:
...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.
Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work, and usually work faster, with very modest memory requirements of one 256 entry array of 32-bit Unicode values and a hash table with 256 1-byte slots and 256 chain entries, each having a 4 byte Unicode value, a byte output value, a byte chain index, and probably 2 bytes of filler, for a hash table size of 2304 bytes per codec. ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Tony Nelson wrote:
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values.
You might be missing the point. \ufffd is REPLACEMENT CHARACTER, which would indicate that the byte with that index is really unused in that encoding.
Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding.
This is what is currently done: a hash map with 256 keys. You are complaining about the performance of that algorithm. The issue of external chaining is likely irrelevant: there likely are no collisions, even though Python uses open addressing.
...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.
That is not implementable. You cannot catch writes to the dictionary.
Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work
Please try to implement it. You will find that you cannot. I don't see how regenerating/editing the codecs could be avoided. Regards, Martin

Martin v. Löwis wrote:
Tony Nelson wrote:
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.
With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values.
You might be missing the point. \ufffd is REPLACEMENT CHARACTER, which would indicate that the byte with that index is really unused in that encoding.
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939 The mapping argument to PyUnicode_DecodeCharmap() can be a unicode string and is used as a decoding table. Speed looks like this: python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 538 usec per loop python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 3.85 msec per loop ./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 539 usec per loop ./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 1000 loops, best of 3: 623 usec per loop Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too. Bye, Walter Dörwald

Walter Dörwald wrote:
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
Looks nice!
Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.
Hmm. How would you represent the string in source code? As a Unicode literal? With \u escapes, or in a UTF-8 source file? Or as a UTF-8 string, with an explicit decode call? I like the current dictionary style for being readable, as it also adds the Unicode character names into comments. Regards, Martin

Martin v. Löwis wrote:
Walter Dörwald wrote:
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
Looks nice!
Indeed (except for the choice of the "map this character to undefined" code point). Hye-Shik, could you please provide some timeit figures for the fastmap encoding ?
Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.
Hmm. How would you represent the string in source code? As a Unicode literal? With \u escapes, or in a UTF-8 source file? Or as a UTF-8 string, with an explicit decode call?
I like the current dictionary style for being readable, as it also adds the Unicode character names into comments.
Not only that: it also allows 1-n and 1-0 mappings which was part of the idea to use a mapping object (such as a dictionary) as basis for the codec. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On 10/6/05, M.-A. Lemburg <mal@egenix.com> wrote:
Hye-Shik, could you please provide some timeit figures for the fastmap encoding ?
(before applying Walter's patch, charmap decoder) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "s.decode(e)" 100 loops, best of 3: 3.35 msec per loop (applied the patch, improved charmap decoder) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 1.11 msec per loop (the fastmap decoder) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 1.04 msec per loop (utf-8 decoder) % ./python Lib/timeit.py -s "s='a'*53*1024; e='utf_8'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 851 usec per loop Walter's decoder and the fastmap decoder run in mostly same way. So the performance difference is quite minor. Perhaps, the minor difference came from the existence of wrapper function on each codecs; the fastmap codec provides functions usable as Codecs.{en,de}code directly. (encoding, charmap codec) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "u.encode(e)" 100 loops, best of 3: 3.51 msec per loop (encoding, fastmap codec) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 536 usec per loop (encoding, utf-8 codec) % ./python Lib/timeit.py -s "s='a'*53*1024; e='utf_8'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 1.5 msec per loop If the encoding optimization can be easily done in Walter's approach, the fastmap codec would be too expensive way for the objective because we must maintain not only fastmap but also charmap for backward compatibility. Hye-Shik

Hye-Shik Chang wrote:
If the encoding optimization can be easily done in Walter's approach, the fastmap codec would be too expensive way for the objective because we must maintain not only fastmap but also charmap for backward compatibility.
IMO, whether a new function is added or whether the existing function becomes polymorphic (depending on the type of table being passed) is a minor issue. Clearly, the charmap API needs to stay for backwards compatibility; in terms of code size or maintenance, I would actually prefer separate functions. One issue apparently is people tweaking the existing dictionaries, with additional entries they think belong there. I don't think we need to preserve compatibility with that approach in 2.5, but I also think that breakage should be obvious: the dictionary should either go away completely at run-time, or be stored under a different name, so that any attempt of modifying the dictionary gives an exception instead of having no interesting effect. I envision a layout of the codec files like this: decoding_dict = ... decoding_map, encoding_map = codecs.make_lookup_tables(decoding_dict) I think it should be possible to build efficient tables in a single pass over the dictionary, so startup time should be fairly small (given that the dictionaries are currently built incrementally, anyway, due to the way dictionary literals work). Regards, Martin

Martin v. Löwis wrote:
Hye-Shik Chang wrote:
If the encoding optimization can be easily done in Walter's approach, the fastmap codec would be too expensive way for the objective because we must maintain not only fastmap but also charmap for backward compatibility.
IMO, whether a new function is added or whether the existing function becomes polymorphic (depending on the type of table being passed) is a minor issue. Clearly, the charmap API needs to stay for backwards compatibility; in terms of code size or maintenance, I would actually prefer separate functions.
OK, I can update the patch accordingly. Any suggestions for the name? PyUnicode_DecodeCharmapString?
One issue apparently is people tweaking the existing dictionaries, with additional entries they think belong there. I don't think we need to preserve compatibility with that approach in 2.5, but I also think that breakage should be obvious: the dictionary should either go away completely at run-time, or be stored under a different name, so that any attempt of modifying the dictionary gives an exception instead of having no interesting effect.
IMHO it should be stored under a different name, because there are codecs (c037, koi8_r, iso8859_11), that reuse existing dictionaries. Or we could have a function that recreates the dictionary from the string.
I envision a layout of the codec files like this:
decoding_dict = ... decoding_map, encoding_map = codecs.make_lookup_tables(decoding_dict)
Apart from the names (and the fact that encoding_map is still a dictionary), that's what my patch does.
I think it should be possible to build efficient tables in a single pass over the dictionary, so startup time should be fairly small (given that the dictionaries are currently built incrementally, anyway, due to the way dictionary literals work).
Bye, Walter Dörwald

Walter Dörwald wrote:
Martin v. Löwis wrote:
Hye-Shik Chang wrote:
If the encoding optimization can be easily done in Walter's approach, the fastmap codec would be too expensive way for the objective because we must maintain not only fastmap but also charmap for backward compatibility.
IMO, whether a new function is added or whether the existing function becomes polymorphic (depending on the type of table being passed) is a minor issue. Clearly, the charmap API needs to stay for backwards compatibility; in terms of code size or maintenance, I would actually prefer separate functions.
OK, I can update the patch accordingly. Any suggestions for the name? PyUnicode_DecodeCharmapString?
No, you can factor this part out into a separate C function - there's no need to add a completely new entry point just for this optimization. Later on we can then also add support for compressed tables to the codec in the same way.
One issue apparently is people tweaking the existing dictionaries, with additional entries they think belong there. I don't think we need to preserve compatibility with that approach in 2.5, but I also think that breakage should be obvious: the dictionary should either go away completely at run-time, or be stored under a different name, so that any attempt of modifying the dictionary gives an exception instead of having no interesting effect.
IMHO it should be stored under a different name, because there are codecs (c037, koi8_r, iso8859_11), that reuse existing dictionaries.
Only koi8_u reuses the dictionary from koi8_r - and it's easy to recreate the codec from a standard mapping file.
Or we could have a function that recreates the dictionary from the string.
Actually, I'd prefer that these operations be done by the codec generator script, so that we don't have additional startup time. The dictionaries should then no longer be generated and instead. I'd like the comments to stay, though. This can be done like this (using string concatenation applied by the compiler): decoding_charmap = ( u'x' # 0x0000 -> 0x0078 LATIN SMALL LETTER X u'y' # 0x0001 -> 0x0079 LATIN SMALL LETTER Y ... ) Either way, monkey patching the codec won't work anymore. Doesn't really matter, though, as this was never officially supported. We've always told people to write their own codecs if they need to modify an existing one and then hook it into the system using either a new codec search function or by adding an appropriate alias. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
[...]
Or we could have a function that recreates the dictionary from the string.
Actually, I'd prefer that these operations be done by the codec generator script, so that we don't have additional startup time. The dictionaries should then no longer be generated and instead. I'd like the comments to stay, though. This can be done like this (using string concatenation applied by the compiler):
decoding_charmap = ( u'x' # 0x0000 -> 0x0078 LATIN SMALL LETTER X u'y' # 0x0001 -> 0x0079 LATIN SMALL LETTER Y ... )
I'd prefer that too.
Either way, monkey patching the codec won't work anymore. Doesn't really matter, though, as this was never officially supported.
We've always told people to write their own codecs if they need to modify an existing one and then hook it into the system using either a new codec search function or by adding an appropriate alias.
OK, so can someone update gencodec.py and recreate the charmap codecs? BTW, is codecs.make_encoding_map part of the official API, or can we change it to expect a string instead of a dictionary? Bye, Walter Dörwald

Hye-Shik Chang wrote:
On 10/6/05, M.-A. Lemburg <mal@egenix.com> wrote:
Hye-Shik, could you please provide some timeit figures for the fastmap encoding ?
Thanks for the timings.
(before applying Walter's patch, charmap decoder)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "s.decode(e)" 100 loops, best of 3: 3.35 msec per loop
(applied the patch, improved charmap decoder)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 1.11 msec per loop
(the fastmap decoder)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 1.04 msec per loop
(utf-8 decoder)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='utf_8'; u=unicode(s, e)" "s.decode(e)" 1000 loops, best of 3: 851 usec per loop
Walter's decoder and the fastmap decoder run in mostly same way. So the performance difference is quite minor. Perhaps, the minor difference came from the existence of wrapper function on each codecs; the fastmap codec provides functions usable as Codecs.{en,de}code directly.
(encoding, charmap codec)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "u.encode(e)" 100 loops, best of 3: 3.51 msec per loop
(encoding, fastmap codec)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 536 usec per loop
(encoding, utf-8 codec)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='utf_8'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 1.5 msec per loop
I wonder why the UTF-8 codec is slower than the fastmap codec in this case.
If the encoding optimization can be easily done in Walter's approach, the fastmap codec would be too expensive way for the objective because we must maintain not only fastmap but also charmap for backward compatibility.
Indeed. Let's go with a patched charmap codec then. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On 10/6/05, M.-A. Lemburg <mal@egenix.com> wrote:
Hye-Shik Chang wrote:
(encoding, fastmap codec)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 536 usec per loop
(encoding, utf-8 codec)
% ./python Lib/timeit.py -s "s='a'*53*1024; e='utf_8'; u=unicode(s, e)" "u.encode(e)" 1000 loops, best of 3: 1.5 msec per loop
I wonder why the UTF-8 codec is slower than the fastmap codec in this case.
I guess that resizing made the difference. fastmap encoder doesn't resize the output buffer at all in the test case while UTF-8 encoder allocates 4*53*1024 bytes and resizes it to 53*1024 bytes in the end. Hye-Shik

Martin v. Löwis wrote:
Walter Dörwald wrote:
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
Looks nice!
Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.
Hmm. How would you represent the string in source code? As a Unicode literal? With \u escapes,
Yes, simply by outputting repr(decoding_string).
or in a UTF-8 source file?
This might get unreadable, if your editor can't detect the coding header.
Or as a UTF-8 string, with an explicit decode call?
This is another possibility, but is unreadable too. But we might add the real codepoints as comments.
I like the current dictionary style for being readable, as it also adds the Unicode character names into comments.
We could use decoding_string = ( u"\u009c" # 0x0004 -> U+009C: CONTROL u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION ... ) However the current approach has the advantage, that only those byte values that differ from the identical mapping have to be specified. Bye, Walter Dörwald

At 8:36 AM +0200 10/5/05, Martin v. Löwis wrote:
Tony Nelson wrote: ...
Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding.
This is what is currently done: a hash map with 256 keys. You are complaining about the performance of that algorithm. The issue of external chaining is likely irrelevant: there likely are no collisions, even though Python uses open addressing.
I think I'm complaining about the implementation, though on decode, not encode. In any case, there are likely to be collisions in my scheme. Over the next few days I will try to do it myself, but I will need to learn Pyrex, some of the Python C API, and more about Python to do it.
...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.
That is not implementable. You cannot catch writes to the dictionary.
I should have been more clear. I am thinking about using a proxy object in the codec's 'encoding_map' and 'decoding_map' slots, that will forward all the dictionary stuff. The proxy will delete the cache on any call which changes the dictionary contents. There are proxy classed and dictproxy (don't know how its implemented yet) so it seems doable, at least as far as I've gotten so far.
Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work
Please try to implement it. You will find that you cannot. I don't see how regenerating/editing the codecs could be avoided.
Will do! ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>

Walter Dörwald wrote:
Am 04.10.2005 um 04:25 schrieb jepler@unpythonic.net:
As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is very slow compared to encoding or decoding with utf-8. Here I'm working with 53k of data instead of 53 megs. (Note: this is a laptop, so it's possible that thermal or battery management features affected these numbers a bit, but by a factor of 3 at most)
$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')" 1000 loops, best of 3: 591 usec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 1.25 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 13.5 msec per loop $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')" 100 loops, best of 3: 13.6 msec per loop
With utf-8 encoding as the baseline, we have decode('utf-8') 2.1x as long decode('mac-roman') 22.8x as long decode('iso8859-1') 23.0x as long
Perhaps this is an area that is ripe for optimization.
For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.
Or we could implement this array as a C array (i.e. gencodec.py would generate C code).
That would be a possibility, yes. Note that the charmap codec was meant as faster replacement for the old string transpose function. Dictionaries are used for the mapping to avoid having to store huge (largely empty) mapping tables - it's a memory-speed tradeoff. Of course, a C version could use the same approach as the unicodedatabase module: that of compressed lookup tables... http://aggregate.org/TechPub/lcpc2002.pdf genccodec.py anyone ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 04 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On 10/5/05, M.-A. Lemburg <mal@egenix.com> wrote:
Of course, a C version could use the same approach as the unicodedatabase module: that of compressed lookup tables...
http://aggregate.org/TechPub/lcpc2002.pdf
genccodec.py anyone ?
I had written a test codec for single byte character sets to evaluate algorithms to use in CJKCodecs once before (it's not a direct implemention of you've mentioned, tough) I just ported it to unicodeobject (as attached). It showed relatively fine result than charmap codecs: % python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('iso8859-1')" 10 loops, best of 3: 96.7 msec per loop % ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('iso8859_10_fc')" 10 loops, best of 3: 22.7 msec per loop % ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('utf-8')" 100 loops, best of 3: 18.9 msec per loop (Note that it doesn't contain any documentation nor good error handling yet. :-) Hye-Shik

Hye-Shik Chang wrote:
On 10/5/05, M.-A. Lemburg <mal@egenix.com> wrote:
Of course, a C version could use the same approach as the unicodedatabase module: that of compressed lookup tables...
http://aggregate.org/TechPub/lcpc2002.pdf
genccodec.py anyone ?
I had written a test codec for single byte character sets to evaluate algorithms to use in CJKCodecs once before (it's not a direct implemention of you've mentioned, tough) I just ported it to unicodeobject (as attached).
Thanks. Please upload the patch to SF. Looks like we now have to competing patches: yours and the one written by Walter. So far you've only compared decoding strings into Unicode and they seem to be similar in performance. Do they differ in encoding performance ?
It showed relatively fine result than charmap codecs:
% python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('iso8859-1')" 10 loops, best of 3: 96.7 msec per loop % ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('iso8859_10_fc')" 10 loops, best of 3: 22.7 msec per loop % ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)" "s.decode('utf-8')" 100 loops, best of 3: 18.9 msec per loop
(Note that it doesn't contain any documentation nor good error handling yet. :-)
Hye-Shik
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

The function the module below, xlate.xlate, doesn't quite do what "".decode does. (mostly that characters that don't exist are mapped to u+fffd always, instead of having the various behaviors avilable to "".decode) It builds the fast decoding structure once per call, but when decoding 53kb of data that overhead is small enough to make it much faster than s.decode('mac-roman'). For smaller buffers (I tried 53 characters), s.decode is two times faster. (43us vs 21us) $ timeit.py -s "s='a'*53*1024; import xlate" "s.decode('mac-roman')" 100 loops, best of 3: 12.8 msec per loop $ timeit.py -s "s='a'*53*1024; import xlate, encodings.mac_roman" \ "xlate.xlate(s, encodings.mac_roman.decoding_map)" 1000 loops, best of 3: 573 usec per loop Jeff
participants (7)
-
"Martin v. Löwis"
-
Hye-Shik Chang
-
jepler@unpythonic.net
-
M.-A. Lemburg
-
Tony Nelson
-
Trent Mick
-
Walter Dörwald