On 8/24/2011 12:34 PM, Guido van Rossum wrote:

On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:

On 8/24/2011 9:00 AM, Stefan Behnel wrote:

Nick Coghlan, 24.08.2011 15:06:

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:

In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character from
2 bytes to 4, attach an array cpdex of character (ie code point, not code
unit) indexes. Then for indexing and slicing, the correction is simple,
simpler than I first expected:
  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying
double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
most of the space penalty and the consequent time penalty of moving more
bytes around and increasing cache misses.

Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

+1

Yes, this sounds like a nice benefit, but the problem is it is false.  The
correct statement would be:

  The nice thing about PEP 383 is that it gives us the smallest storage
  array that is both an ordinary C array and has sufficiently large
  individual elements to handle every Unicode codepoint in the string.

(PEP 393, I presume. :-)

This statement might yet be made true :)

As Tom eloquently describes in the referenced issue (is Tom ever
non-eloquent?), not all characters can be represented in a single codepoint.

But this is also besides the point (except insofar where we have to
remind ourselves not to confuse the two in docs).

In the docs, yes, and in programmer's minds (influenced by docs).

It seems there are three concepts in Unicode, code units, codepoints, and
characters, none of which are equivalent (and the first of which varies
according to the encoding). It also seems (to me) that Unicode has failed
in its original premise, of being an easy way to handle "big char" for "all
languages" with fixed size elements, but it is not clear that its original
premise is achievable regardless of the size of "big char", when mixed
directionality is desired, and it seems that support of some single
languages require mixed directionality, not to mention mixed language
support.

I see nothing wrong with having the language's fundamental data types
(i.e., the unicode object, and even the re module) to be defined in
terms of codepoints, not characters, and I see nothing wrong with
len() returning the number of codepoints (as long as it is advertised
as such).

Me neither.

After all UTF-8 also defines an encoding for a sequence of
code points. Characters that require two or more codepoints are not
represented special in UTF-8 -- they are represented as two or more
encoded codepoints. The added requirement that UTF-8 must only be used
to represent valid characters is just that -- it doesn't affect how
strings are encoded, just what is considered valid at a higher level.

Yes, this is true. In one sense, though, since UTF-8-supporting code already has to deal with variable length codepoint encoding, support for variable length character encoding seems like a minor extension, not upsetting any concept of fixed-width optimizations, because such cannot be used.

Given the required variability of character size in all presently Unicode
defined encodings, I tend to agree with Tom that UTF-8, together with some
technique of translating character index to code unit offset, may provide
the best overall space utilization, and adequate CPU efficiency.

There is no doubt that UTF-8 is the most space efficient. I just don't
think it is worth giving up O(1) indexing of codepoints -- it would
change programmers' expectations too much.

Programmers that have to deal with bidi or composed characters shouldn't have such expectations, of course. But there are many programmers who do not, or at least who think they do not, and they can retain their O(1) expectations, I suppose, until it bites them.

OTOH I am sold on getting rid of the added complexities of "narrow
builds" where not even all codepoints can be represented without using
surrogate pairs (i.e. two code units per codepoint) and indexing uses
code units instead of codepoints. I think this is an area where PEP
393 has a huge advantage: users can get rid of their exceptions for
narrow builds.

Yep, the only justification for narrow builds is in interfacing to underlying broken OS that happen to use that encoding... it might be slightly more efficient when doing API calls to such an OS. But most interesting programs do much more than I/O.

On the
other hand, there are large subsets of applications that simply do not
require support for bidirectional text or composed characters, and for those
that do not, it remains to be seen if the price to be paid for supporting
those features is too high a price for such applications. So far, we don't
have implementations to benchmark to figure that out!

I think you are saying that many apps can ignore the distinction
between codepoints and characters. Given the complexity of bidi
rendering and normalization (which will always remain an issue) I
agree; this is much less likely to be a burden than the narrow-build
issues with code units vs. codepoints.

What should the stdlib do? It should try to skirt the issue where it
can (using the garbage-in-garbage-out principle) and advertise what it
supports where there is a difference. I don't see why all the stdlib
should be made aware of multi-codepoint-characters and other bidi
requirements, but it should be clear to the user who has such
requirements which stdlib operations they can safely use.

It would seem helpful if the stdlib could have some support for efficient handling of Unicode characters in some representation. It would help address the class of applications that does care. Adding extra support for Unicode character handling sooner rather than later could be an performance boost to applications that do care about full character support, and I can only see the numbers of such applications increasing over time. Such could be built as a subtype of str, perhaps, but if done in Python, there would likely be a significant performance hit when going from str to "unicodeCharacterStr".

What does this mean for Python?  Well, if Python is willing to limit its
support for applications to the subset for which the "big char" solution
sufficient, then PEP 393 provides a way to do that, that looks to be pretty
effective for reducing memory consumption for those applications that use
short strings most of which can be classified by content into the 1 byte or
2 byte representations.  Applications that support long strings are more
likely to bitten by the occasional "outlier" character that is longer than
the average character, doubling or quadrupling the space needed to represent
such strings, and eliminating a significant portion of the space savings the
PEP is providing for other applications.

This seems more of an intuition than a fact. I could easily imagine
the facts being that even for large strings, usually either there are
no outliers, or there is a significant number of outliers. (E.g. Tom
Christiansen's OSCON preso falls in the latter category :-).

As long as it *works* I don't really mind that there are some extreme
cases that are slow. You'll always have that.

Yes, it is intuition, regarding memory consumption. It is not at all clear how different the "occasional outlier character" is than your "significant number of outliers". Tom's presentation certainly was regarding bodies of text which varied from ASCII to fully non-ASCII.

The memory characteristics of long string handling would certainly be non-intuitive, when you can process a file of size N with a particular program, but can't process a smaller file because it has a funny character in it, and suddenly you are out of space.

Benchmarks may or may not fully
reflect the actual requirements of all applications, so conclusions based on
benchmarking can easily be blind-sided the realities of other applications,
unless the benchmarks are carefully constructed.

Yeah, it's a learning process.

It is possible that the ideas in PEP 393, with its support for multiple
underlying representations, could be the basis for some more complex
representations that would better support characters rather than only
supporting code points, but Martin has stated he is not open to additional
representations, so the PEP itself cannot be that basis (although with care
which may or may not be taken in the implementation of the PEP, the
implementation may still provide that basis).

There is always the possibility of representations that are defined
purely by userland code and can only be manipulated by that specific
code. But expecting C extensions to support new representations that
haven't been defined yet sounds like a bad idea.

While they can and should be prototyped in Python for functional correctness, I would rather expect such representations to be significantly slower in Python than in C. But that is just intuition also. The PEP makes a nice extension to str representations, but I'm not sure it picks the most useful ones, in that while it is picking cases that are well understood and are in use, they may not be the most effective ones (due to the strange memory consumption characteristics that outliers can introduce). My intuition says that a UTF-8 representation (or Tom's/Perl's looser utf8) would be a handy representation to have. But maybe it should be a different type than str... str8? I suppose that is -ideas land.