[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 06:07:52 CET 2010

Note that I'm not saying that there shouldn't be a UTF-8 string type;
I'm just saying that for some purposes it might be a good idea to keep
UTF-16 and UTF-32 string types around.

Glyph Lefkowitz writes:

 > > The theory is that accessing the first character of a region in a
 > > string often occurs as a primitive operation in O(N) or worse
 > > algorithms, sometimes without enough locality at the "collection of
 > > regions" level to give a reasonably small average access time.
 > 
 > I'm not sure what you mean by "the theory is".  Whose theory?  About what?

Mine.  About why somebody somewhere someday would need fast random
access to character positions.  "Nobody ever needs that" is a strong
claim.

 > > In practice, any *Emacs user can tell you that yes, we do need to be
 > > able to access the Nth codepoint in a buffer in constant time.  The
 > > O(N) behavior of current Emacs implementations means that people often
 > > use a binary coding system on large files.  Yes, some position caching
 > > is done, but if you have a large file (eg, a mail file) which is
 > > virtually segmented using pointers to regions, locality gets lost.
 > > (This is not a design bug, this is a fundamental requirement: consider
 > > fast switching between threaded view and author-sorted view.)
 > 
 > Sounds like a design bug to me.  Personally, I'd implement "fast
 > switching between threaded view and author-sorted view" the same
 > way I'd address any other multiple-views-on-the-same-data problem.
 > I'd retain data structures for both, and update them as the
 > underlying model changed.

Um, that's precisely the design I'm talking about.  But as you
recognize later, the message content is not part of those structures
because there's no real point in copying it *if you have fast access
to character positions*.  In a variable width character, character-
addressed design, there can be a perceptible delay in accessing even
the "next" message's content if you're in the wrong view.

 > These representations may need to maintain cursors into the
 > underlying character data, if they must retain giant wads of
 > character data as an underlying representation (arguably the _main_
 > design bug in Emacs, that it encourages you to do that for
 > everything, rather than imposing a sensible structure), but those
 > cursors don't need to be code-point counters; they could be byte
 > offsets, or opaque handles whose precise meaning varied with the
 > potentially variable underlying storage.

Both byte offsets and opaque handles really really suck to design,
implement, and maintain, if Lisp or Python level users can use them.
They're hard enough to do when you can hide them behind internal APIs,
but if they're accessible to users they're an endless source of user
bugs.  What was that you were saying about the difficulty of
remembering which argument is the fd?  It's like that.  Sure, you can
design APIs to help get that right, but it's not easy to provide one
that can be used for all the different applications out there.

 > Also, please remember that Emacs couldn't be implemented with giant
 > Python strings anyway: crucially, all of this stuff is _mutable_ in
 > Emacs.

No, that's a red herring.  The use-cases where Emacs users complain
most is browsing giant logs and reading old mail; neither needs the
content to be mutable (although of course it's a convenience in the
mail case if you delete messages or fetch new mail, but that could be
done with transaction logs that get appended to the on-disk file).

 > Case in point: "occur" needs to scan the buffer anyway; you can't
 > do better than linear time there.  So you're going to iterate
 > through the buffer, using one of the techniques that James
 > proposed, and remember some locations.  Why not just have those
 > locations be opaque cursors into your data?

They are.  But unless you're willing to implement correct character
motion, they need to be character indicies, which will be slow to
access the actual locations.  We've implemented caches, as does Emacs,
but they don't always get hits.  Finding an arbitrary position once
can involve perceptible delay on up to 1GHz machines; doing it in a
loop (which mail programs have a habit of doing) could be very
painful.

 > In summary: you're right, in that James missed a spot.  You need
 > bidirectional, *copyable* iterators that can traverse the string by
 > byte, codepoint, grapheme, or decomposed glyph.

That's a good start, yes.  But once you talk about "remembering some
locations", you're implicitly talking about random access.  Either you
maintain position indexes which naively implemented can easily be
close to the size of the text buffer (indexes are going to be at least
4 bytes, possibly 8, per position, and something like "occur" can
generate a lot of positions) -- in which case you might as well just
use a representation that is an array in the first place -- or you
need to implement a position cache which can be very hairy to do well.
Or you can give user programs memory indicies, and enjoy the fun as
the poor developers do things like "pos += 1" which works fine on
the ASCII data they have lying around, then wonder why they get
Unicode errors when they take substrings.

I'm sure it all can be done, but I don't think it will be done right
the first time around.  You may be right that designs better adapted
to large data sets than Emacs's "everything is a big buffer" will
almost always be available with reasonable effort.  But remember, a
lot of good applications start small, when a flat array might make
lots of sense as the underlying structure, and then need to scale.  If
you need to scale for the paying customers, well, "ouch!" but you can
afford it, but for many volunteer or startup projects that takes the
wind right out of your sails.

Note that if the user doesn't use private space, in a UCS-2 build you
have about 1.5K code points available for compressing non-BMP
characters into a 2-byte, valid Unicode representation (of course you
need to save off the table somewhere if that ever gets out of your
program, but that's easy).  I find it hard to imagine that there will
be many use-cases that need more than that many non-BMP characters.
So probably you can tell those few users who care to use a UCS-4
build; most of the array use-cases can be served by UCS-2.  Note that
in my Japanese corpuses, UTF-8 averages just about 2 bytes per
character anyway, and those are mail files, where two lines of
Japanese may be preceded by 2KB of ASCII-only header.  I suspect
Hebrew, Arabic, and Cyrillic users will have similar experiences.

By the way, to send the ball back into your court, I have this feeling
that the demand for UTF-8 is once again driven by native English
speakers who are very shortly going to find themselves, and the data
they are most familiar with, very much in the minority.  Of course the
market that benefits from UTF-8 compression will remain very large for
the immediate future, but in the grand scheme of things, most of the
world is going to prefer UTF-16 by a substantial margin.

N.B.  I'm not talking about persistent storage, where it's 6 of one
and half a dozen of the other; you can translate UTF-8 to UTF-16 way
faster than you can read content from disk, of course.