[Python-Dev] len(chr(i)) = 2?
Stephen J. Turnbull
stephen at xemacs.org
Wed Nov 24 03:44:40 CET 2010
James Y Knight writes:
> You put a smiley, but, in all seriousness, I think that's actually
> the right thing to do if anyone writes a new programming
> language. It is clearly the right thing if you don't have to be
> concerned with backwards-compatibility: nobody really needs to be
> able to access the Nth codepoint in a string in constant time, so
> there's not really any point in storing a vector of codepoints.
A sad commentary on the state of Emacs usage, "nobody".
The theory is that accessing the first character of a region in a
string often occurs as a primitive operation in O(N) or worse
algorithms, sometimes without enough locality at the "collection of
regions" level to give a reasonably small average access time.
In practice, any *Emacs user can tell you that yes, we do need to be
able to access the Nth codepoint in a buffer in constant time. The
O(N) behavior of current Emacs implementations means that people often
use a binary coding system on large files. Yes, some position caching
is done, but if you have a large file (eg, a mail file) which is
virtually segmented using pointers to regions, locality gets lost.
(This is not a design bug, this is a fundamental requirement: consider
fast switching between threaded view and author-sorted view.)
And of course an operation that sorts regions in a buffer using
character pointers will have the same problem. Working with memory
pointers, OTOH, sucks more than that; GNU Emacs recently bit the
bullet and got rid of their higher-level memory-oriented APIs, all of
the Lisp structures now work with pointers, and only the very
low-level structures know about character-to-memory pointer
translation.
This performance issue is perceptible even on 3GHz machines with not
so large (50MB) mbox files. It's *horrid* if you do something like
"occur" on a 1GB log file, then try randomly jumping to detected log
entries.
More information about the Python-Dev
mailing list