Re: [Python-Dev] Unicode charmap decoders slow
I have written my fastcharmap decoder and encoder. It's not meant to be better than the patch and other changes to come in a future version of Python, but it does work now with the current codecs. Using Hye-Shik Chang's benchmark, decoding is about 4.3x faster than the base, and encoding is about 2x faster than the base (that's comparing the base and the fast versions on my machine). If fastcharmap would be useful, please tell me where I should make it available, and any changes that are needed. I would also need to write an installer (distutils I guess). <http://georgeanelson.com/fastcharmap.d.tar.gz> Fastcharmap is written in Python and Pyrex 0.9.3, and the .pyx file will need to be compiled before use. I used: pyrexc _fastcharmap.pyx gcc -c -fPIC -I/usr/include/python2.3/ _fastcharmap.c gcc -shared _fastcharmap.o -o _fastcharmap.so To use, hook each codec to be speed up: import fastcharmap help(fastcharmap) fastcharmap.hook('name_of_codec') u = unicode('some text', 'name_of_codec') s = u.encode('name_of_codec') No codecs were rewritten. It took me a while to learn enough to do this (Pyrex, more Python, some Python C API), and there were some surprises. Hooking in is grosser than I would have liked. I've only used it on Python 2.3 on FC3. Still, it should work going forward, and, if the dicts are replaced by something else, fastcharmap will know to leave everything alone. There's still a tiny bit of debugging print statements in it.
At 8:36 AM +0200 10/5/05, Martin v. Löwis wrote:
Tony Nelson wrote: ...
Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding.
This is what is currently done: a hash map with 256 keys. You are complaining about the performance of that algorithm. The issue of external chaining is likely irrelevant: there likely are no collisions, even though Python uses open addressing.
I think I'm complaining about the implementation, though on decode, not encode.
In any case, there are likely to be collisions in my scheme. Over the next few days I will try to do it myself, but I will need to learn Pyrex, some of the Python C API, and more about Python to do it.
...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.
That is not implementable. You cannot catch writes to the dictionary.
I should have been more clear. I am thinking about using a proxy object in the codec's 'encoding_map' and 'decoding_map' slots, that will forward all the dictionary stuff. The proxy will delete the cache on any call which changes the dictionary contents. There are proxy classed and dictproxy (don't know how its implemented yet) so it seems doable, at least as far as I've gotten so far.
Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work
Please try to implement it. You will find that you cannot. I don't see how regenerating/editing the codecs could be avoided.
Will do!
TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>
Tony Nelson wrote:
I have written my fastcharmap decoder and encoder. It's not meant to be better than the patch and other changes to come in a future version of Python, but it does work now with the current codecs.
It's an interesting solution.
To use, hook each codec to be speed up:
import fastcharmap help(fastcharmap) fastcharmap.hook('name_of_codec') u = unicode('some text', 'name_of_codec') s = u.encode('name_of_codec')
No codecs were rewritten. It took me a while to learn enough to do this (Pyrex, more Python, some Python C API), and there were some surprises. Hooking in is grosser than I would have liked. I've only used it on Python 2.3 on FC3.
Indeed, and I would claim that you did not completely achieve your "no changes necessary" goal: you still have to install the hooks explicitly. I also think overriding codecs.charmap_{encode,decode} is really ugly. Even if this could be simplified if you would modify the existing codecs, I still don't think supporting changes to the encoding dict is worthwhile. People will probably want to update the codecs in-place, but I don't think we need to make a guarantee that that such an approach works independent of the Python version. People would be much better off writing their own codecs if they think the distributed ones are incorrect. Regards, Martin
Martin v. Löwis wrote:
Tony Nelson wrote:
I have written my fastcharmap decoder and encoder. It's not meant to be better than the patch and other changes to come in a future version of Python, but it does work now with the current codecs.
It's an interesting solution.
I like the fact that encoding doesn't need a special data structure.
To use, hook each codec to be speed up:
import fastcharmap help(fastcharmap) fastcharmap.hook('name_of_codec') u = unicode('some text', 'name_of_codec') s = u.encode('name_of_codec')
No codecs were rewritten. It took me a while to learn enough to do this (Pyrex, more Python, some Python C API), and there were some surprises. Hooking in is grosser than I would have liked. I've only used it on Python 2.3 on FC3.
Indeed, and I would claim that you did not completely achieve your "no changes necessary" goal: you still have to install the hooks explicitly. I also think overriding codecs.charmap_{encode,decode} is really ugly.
Even if this could be simplified if you would modify the existing codecs, I still don't think supporting changes to the encoding dict is worthwhile. People will probably want to update the codecs in-place, but I don't think we need to make a guarantee that that such an approach works independent of the Python version. People would be much better off writing their own codecs if they think the distributed ones are incorrect.
Exacty. If you need another codec write your own insteaad of patching an existing one on the fly! Of course we can't accept Pyrex code in the Python core, so it would be great to rewrite the encoder as a patch to PyUnicode_EncodeCharmap(). This version must be able to cope with encoding tables that are random strings without crashing. We've already taken care of decoding. What we still need is a new gencodec.py and regenerated codecs. Bye, Walter Dörwald
Walter Dörwald wrote:
We've already taken care of decoding. What we still need is a new gencodec.py and regenerated codecs.
I'll take care of that; just haven't gotten around to it yet. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 14 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Walter Dörwald wrote:
Of course we can't accept Pyrex code in the Python core, so it would be great to rewrite the encoder as a patch to PyUnicode_EncodeCharmap(). This version must be able to cope with encoding tables that are random strings without crashing.
I don't think this will be necessary. I personally dislike the decoding tables, as I think a straight-forward trie will do better than a hashtable. Regards, Martin
I have put up a new, packaged version of my fast charmap module at <http://georgeanelson.com/fastcharmap> . Hopefully it is packaged properly and works properly (it works on my FC3 Python 2.3.4 system). This version is over 5 times faster than the base codec according to Hye-Shik Chang's benchmark (mostly from compiling it with -O3). I bring it up here mostly because I mention in its docs that improved faster charmap codecs are coming from the Python developers. Is it OK to say that, and have I said it right? I'll take that out if you folks want. I understand that my module is not favored by Martin v. Löwis, and I don't advocate it becoming part of Python. My web page and docs say that it may be useful until Python has the faster codecs. It allows changing the mappings because that is part of the current semantics -- a new version of Python can certainly change those semantics. I want to thank you all for so quickly going to work on the problem of making charmap codecs faster. It's to the benefit of Python users everywhere to have faster charmap codecs in Python. Your quickness impressed me. BTW, Martin, if you care to, would you explain to me how a Trie would be used for charmap encoding? I know a couple of approaches, but I don't know how to do it fast. (I've never actually had the occasion to use a Trie.) ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>
Tony Nelson wrote:
BTW, Martin, if you care to, would you explain to me how a Trie would be used for charmap encoding? I know a couple of approaches, but I don't know how to do it fast. (I've never actually had the occasion to use a Trie.)
I currently envision a three-level trie, with 5, 4, and 7 bits. You take the Unicode character (only chacters below U+FFFF supported), and take the uppermost 5 bits, as index in an array. There you find the base of a second array, to which you add the next 4 bits, which gives you an index into a third array, where you add the last 7 bits. This gives you the character, or 0 if it is unmappable. struct encoding_map{ unsigned char level0[32]; unsigned char *level1; unsigned char *level2; }; struct encoding_map *table; Py_UNICODE character; int level1 = table->level0[character>>11]; if(level1==0xFF)raise unmapped; int level2 = table->level1[16*level1 + ((character>>7) & 0xF)]; if(level2==0xFF)raise unmapped; int mapped = table->level2[128*level2 + (character & 0x7F)]; if(mapped==0)raise unmapped; Over a hashtable, this has the advantage of not having to deal with collisions. Instead, it guarantees you a lookup in a constant time. It is also quite space-efficient: all tables use bytes as indizes. As each level0 deals with 2048 characters, most character maps will only use 1 or two level1 blocks, meaning 16 or 32 bytes for level1. The number of level3 blocks required depends on the number of 127-character rows which the encoding spans; for most encodings, 3 or four such blocks will be sufficient (with ASCII spanning one such block typically), causing the entire memory consumption for many encodings to be less than 600 bytes. It would be possible to remove one level of indirection (giving some more speed) at the expense of additional memory: for example, and 8-8 trie would use 256 bytes for level 0, and then 256 bytes for each Unicode row where the encoding has characters, likely resulting in 1kiB for a typical encoding. Hye-Shik Chang version of the fast codec uses such an 8-8 trie, but conserves space by observing that most 256-char rows are only sparsely used by encodings (e.g. if you have only ASCII, you use only 128 characters from row 0). He therefore strips unused cells from the row, by also recording the first and last character per row. This brings some space back, at the expense of slow-down again. Regards, Martin
At 11:56 AM +0200 10/16/05, Martin v. Löwis wrote:
Tony Nelson wrote:
BTW, Martin, if you care to, would you explain to me how a Trie would be used for charmap encoding? I know a couple of approaches, but I don't know how to do it fast. (I've never actually had the occasion to use a Trie.)
I currently envision a three-level trie, with 5, 4, and 7 bits. You take the Unicode character (only chacters below U+FFFF supported), and take the uppermost 5 bits, as index in an array. There you find the base of a second array, to which you add the next 4 bits, which gives you an index into a third array, where you add the last 7 bits. This gives you the character, or 0 if it is unmappable.
Umm, 0 (NUL) is a valid output character in most of the 8-bit character sets. It could be handled by having a separate "exceptions" string of the unicode code points that actually map to the exception char. Usually "exceptions" would be a string of length 1. Suggested changes below.
struct encoding_map{ unsigned char level0[32]; unsigned char *level1; unsigned char *level2;
Py_UNICODE *exceptions;
};
struct encoding_map *table; Py_UNICODE character; int level1 = table->level0[character>>11]; if(level1==0xFF)raise unmapped; int level2 = table->level1[16*level1 + ((character>>7) & 0xF)]; if(level2==0xFF)raise unmapped; int mapped = table->level2[128*level2 + (character & 0x7F)];
change:
if(mapped==0)raise unmapped;
to: if(mapped==0) { Py_UNICODE *ep; for(ep=table->exceptions; *ep; ep++) if(*ep==character) break; if(!*ep)raise unmapped; }
Over a hashtable, this has the advantage of not having to deal with collisions. Instead, it guarantees you a lookup in a constant time.
OK, I see the benefit. Your code is about the same amount of work as the hash table lookup in instructions, indirections, and branches, normally uses less of the data cache, and has a fixed running time. It may use one more branch, but its branches are easily predicted. Thank you for explaining it.
It is also quite space-efficient: all tables use bytes as indizes. As each level0 deals with 2048 characters, most character maps will only use 1 or two level1 blocks, meaning 16 or 32 bytes for level1. The number of level3 blocks required depends on the number of 127-character rows which the encoding spans; for most encodings, 3 or four such blocks will be sufficient (with ASCII spanning one such block typically), causing the entire memory consumption for many encodings to be less than 600 bytes. ...
As you are concerned about pathological cases for hashing (that would make the hash chains long), it is worth noting that in such cases this data structure could take 64K bytes. Of course, no such case occurs in standard encodings, and 64K is affordable as long is it is rare. ____________________________________________________________________ TonyN.:' <mailto:tonynelson@georgeanelson.com> ' <http://www.georgeanelson.com/>
Tony Nelson wrote:
Umm, 0 (NUL) is a valid output character in most of the 8-bit character sets. It could be handled by having a separate "exceptions" string of the unicode code points that actually map to the exception char.
Yes. But only U+0000 should normally map to 0. It could be special-cased altogether.
As you are concerned about pathological cases for hashing (that would make the hash chains long), it is worth noting that in such cases this data structure could take 64K bytes. Of course, no such case occurs in standard encodings, and 64K is affordable as long is it is rare.
I'm not concerned with long hash chains, I dislike having collisions in the first place (even if they are only for two code points). Having to deal with collisions makes the code more complicated, and less predictable. It's primarily a matter of taste: avoid hashtables if you can :-) Regards, Martin
participants (4)
-
"Martin v. Löwis" -
M.-A. Lemburg -
Tony Nelson -
Walter Dörwald