[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 10:03:29 CET 2010

James Y Knight writes:

 > a) You seem to be hung up implementation details of emacs.

Hung up?  No.  It's the program whose text model I know best, and even
if its design could theoretically be a lot better for this purpose, I
can't say I've seen a real program whose model is obviously better for
the purpose of a language for implementing text editors.[1]  So it's not
obvious to me that its model can be ruled out on a priori grounds.  If
not, it would be nice if your new language could implement it
efficiently without contorted programming.

 >    But yes, positions should be stored as an byte offset into the
 >    utf8 string. NOT as number of codepoints since the beginning of
 >    the string. Probably you want it to be somewhat opaque, so that
 >    you actually have to specify whether you wanted to go to +1
 >    byte, codepoint, or grapheme.

Well, first of all, +1 byte should not be available to a text
iterator, at least not with the same iterator/position object that
implements character and/or grapheme movement.  (You seem to have
thought about this issue a lot, but mixing bytes with text units makes
wonder how much practical implementation you've done.)

Second, incrementing to grapheme boundaries is relatively easy to do
efficiently, just as incrementing to a UTF-8 character boundary is
easy to do.  We already do the latter, the former is pragmatically
harder, but not a conceptual stretch.  That's not the question.  The
question is how do we identify an arbitrary position in the text?
Sometimes it's nice to have a numerical measure of size or location.

It is not obvious that position by grapheme count is going to be the
obvious way to determine position in a text.  Eg, for languages with
variable metric characters, character counts as a way of lining up
table columns is going the way of Tyrannosaurus.  In the Han-using
languages, yes, column counts within lines are going to be important
forever, because the characters are literally square for most
practical purposes ... but they don't use composing characters (all
the Japanese kana are precomposed, for example), so position by
grapheme is going to be very close to position by character, and fine
positioning will be done either by mouse or by incrementing the last
few characters.  Nor do I think operations like "advance 1,000,000
characters" will have less meaning than "advance 1,000,000 graphemes."
Both of them are just a way of saying "go way far away", end up in
about the same place, and where there's a bias, it will be pretty
consistent in a statistical sense for any given natural language (and
therefore, for 99% of users).

 > But once you [the language implementor] are providing correct
 > abstractions for grapheme movement, it's just as easy to also
 > provide an abstraction for codepoint movement, and make your
 > low-level implementation of the iterator object be a byte-offset
 > into a UTF8 buffer.

Sure, that's fine for something that just iterates over the text.  But
if you actually need to remember positions, or regions, to jump to
later or to communicate to other code that manipulates them, doing
this stuff the straightforward way (just copying the whole iterator
object to hang on to its state) becomes expensive.  You end up
proliferating types that all do the same kind of thing.  Judicious use
of inheritance helps, but getting the fundamental abstraction right is
hard.  Or least, Emacs hasn't found it in 20 years of trying.

OTOH, all that stuff "just works" and just works efficiently, up to
the grapheme vs. character issue, with an array.

About that issue, to go back to tired old Emacs, *all* of the things I
can think of that I might want to do by grapheme (display, insert,
delete, move a few places) do fit the "increment until done" model.
These things already work quite well for the variable-width buffer
that "multilingual" Emacsen use, whether the old Mule encoding or
UTF-8.  So I can see how the UTF-8 model with appropriate iterators
for characters and graphemes can work well for lots of applications
and use cases.

But Emacs already has opaque "markers", yet nevertheless the use of
integer character positions in strings and buffers has survived.  That
*may* have to do with mutability, and the "all the world is a buffer"
design, as Glyph suggested, but I think it more likely that markers
are very expense to create and use compared to integers.  Perhaps an
editor of power similar to Emacs could be implemented with string
operations on lines, or the like, and these issues would go away.  But
it's not obvious to me.

Footnotes: 
[1]  Yes, I know that not all programs are text editors.  So shoot
me.  It's still the text manipulation program I know best, and it's
not obvious to me that it's the unique class that would need these
features.