[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 03:44:40 CET 2010

James Y Knight writes:

 > You put a smiley, but, in all seriousness, I think that's actually
 > the right thing to do if anyone writes a new programming
 > language. It is clearly the right thing if you don't have to be
 > concerned with backwards-compatibility: nobody really needs to be
 > able to access the Nth codepoint in a string in constant time, so
 > there's not really any point in storing a vector of codepoints.

A sad commentary on the state of Emacs usage, "nobody".

The theory is that accessing the first character of a region in a
string often occurs as a primitive operation in O(N) or worse
algorithms, sometimes without enough locality at the "collection of
regions" level to give a reasonably small average access time.

In practice, any *Emacs user can tell you that yes, we do need to be
able to access the Nth codepoint in a buffer in constant time.  The
O(N) behavior of current Emacs implementations means that people often
use a binary coding system on large files.  Yes, some position caching
is done, but if you have a large file (eg, a mail file) which is
virtually segmented using pointers to regions, locality gets lost.
(This is not a design bug, this is a fundamental requirement: consider
fast switching between threaded view and author-sorted view.)

And of course an operation that sorts regions in a buffer using
character pointers will have the same problem.  Working with memory
pointers, OTOH, sucks more than that; GNU Emacs recently bit the
bullet and got rid of their higher-level memory-oriented APIs, all of
the Lisp structures now work with pointers, and only the very
low-level structures know about character-to-memory pointer
translation.

This performance issue is perceptible even on 3GHz machines with not
so large (50MB) mbox files.  It's *horrid* if you do something like
"occur" on a 1GB log file, then try randomly jumping to detected log
entries.