PEP 393 Summer of Code Project

Hello all, I have implemented an initial version of PEP 393 -- "Flexible String Representation" as part of my Google Summer of Code project. My patch is hosted as a repository on bitbucket [1] and I created a related issue on the bug tracker [2]. I posted documentation for the current state of the development in the wiki [3]. Current tests show a potential reduction of memory by about 20% and CPU by 50% for a join micro benchmark. Starting a new interpreter still causes 3244 calls to create compatibility Py_UNICODE representations, 263 strings are created using the old API while 62719 are created using the new API. More measurements are on the wiki page [3]. If there is interest, I would like to continue working on the patch with the goal of getting it into Python 3.3. Any and all feedback is welcome. Regards, Torsten [1]: http://www.python.org/dev/peps/pep-0393 [2]: http://bugs.python.org/issue12819 [3]: http://wiki.python.org/moin/SummerOfCode/2011/PEP393

Hello, On Mon, 22 Aug 2011 14:58:51 -0400 Torsten Becker <torsten.becker@gmail.com> wrote:
A couple of minor comments: - “The UTF-8 decoding fast path for ASCII only characters was removed and replaced with a memcpy if the entire string is ASCII.” The fast path would still be useful for mostly-ASCII strings, which are extremely common (unless UTF-8 has become a no-op?). - You could trim the debug results from the benchmark results, this may make them more readable. - You could try to run stringbench, which can be found at http://svn.python.org/projects/sandbox/trunk/stringbench (*) and there's iobench (the text mode benchmarks) in the Tools/iobench directory. (*) (yes, apparently we forgot to convert this one to Mercurial) Regards Antoine.

Am 23.08.2011 11:46, schrieb Xavier Morel:
I know - I still question whether it is "extremely common" (so much as to justify a special case). I.e. on what application with what dataset would you gain what speedup, at the expense of what amount of extra lines, and potential slow-down for other datasets? For the record, the optimization in question is the one where it masks a long word with 0x80808080L, to see whether it is completely ASCII, and then copies four characters in an unrolled fashion. It stops doing so when it sees a non-ASCII character, and returns to that mode when it gets to the next aligned memory address that stores only ASCII characters. In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size. Regards, Martin

Well, it's: - all natural languages based on a variant of the latin alphabet - but also, XML, JSON, HTML documents... - and log files... - in short, any kind of parsable format which is structurally ASCII but and can contain arbitrary unicode So I would say *most* unicode data out there is mostly-ASCII, even when it has Japanese characters in it. The rationale is that most unicode data processed by computers is structured. This optimization was done when trying to improve the speed of text I/O.
Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach? Regards Antoine.

This optimization was done when trying to improve the speed of text I/O.
So what speedup did it achieve, for the kind of data you talked about?
Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach?
It's a single implementation - see for yourself. Regards, Martin

Le mardi 23 août 2011 à 13:51 +0200, "Martin v. Löwis" a écrit :
This optimization was done when trying to improve the speed of text I/O.
So what speedup did it achieve, for the kind of data you talked about?
Since I don't have the number anymore, I've just saved the contents of https://linuxfr.org/news/le-noyau-linux-est-disponible-en-version%C2%A030 as a "linuxfr.html" file and then did: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 859 usec per loop After disabling the fast path, I ran the micro-benchmark again: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 1.09 msec per loop so that's a 20% speedup.
So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. Even without taking into account the unrolled loop, I wonder how much slower UTF-8 decoding becomes with that approach, by the way. Instead of testing the "kind" variable at each loop iteration, using a stringlib-like approach may be a better deal IMO. Of course we would first need to have various benchmark numbers once the current PEP 393 implementation is complete. Regards Antoine.

So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.
Depending on where the speedup comes from in this optimization, it may well be that the overhead of figuring out where to store the result eats the gain from the fast test.
Even without taking into account the unrolled loop, I wonder how much slower UTF-8 decoding becomes with that approach, by the way.
In some cases, tests show that it gets faster, overall, compared to 3.2. This is probably because strings take less memory, which means less copying, more cache locality, etc. Of course, it still may be possible to apply micro-optimizations to the new implementation.
Well, things have to be done in order: 1. the PEP needs to be approved 2. the performance bottlenecks need to be identified 3. optimizations should be applied. I'm not sure what you mean by "stringlib-like" approach - if you are talking about templating, I'd rather avoid this for maintainability reasons, unless significant improvements can be demonstrated. Torsten had a version that used macros for that, and it was a pain to debug. So we put correctness and readability first. Regards, Martin

Sure, but the whole point of the PEP is to improve performance (I am dumping "memory consumption" in the "performance" bucket). That is, I suppose it will get approval based on its demonstrated benefits.
The point of templating is precisely to avoid macros, so that the code is natural to read and write and the compiler gives you the right line number when it finds an error. Regards Antoine.

On Tue, Aug 23, 2011 at 11:21 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
As Martin noted, cache misses hurt performance so much on modern processors that making things use less memory overall can actually be a speed optimisation as well. Guessing where the remaining bottlenecks are is unlikely to be effective - profiling of the preliminary implementation will be needed. However, the idea that reducing the size of pure ASCII strings (which include all the identifiers in most code) by a factor of 2 or 4 (or so) results in a net speed increase definitely sounds plausible to me, even for non-string processing code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8/23/2011 9:21 AM, Victor Stinner wrote:
The current UCS2 Unicode string implementation, by design, quickly gives WRONG answers for len(), iteration, indexing, and slicing if a string contains any non-BMP (surrogate pair) Unicode characters. That may have been excusable when there essentially were no such extended chars, and the few there were were almost never used. But now there are many more, with more being added to each Unicode edition. They include cursive Math letters that are used in English documents today. The problem will slowly get worse and Python, at least on Windows, will become a language to avoid for dependable Unicode document processing. 3.x needs a proper Unicode implementation that works for all strings on all builds. utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post. -- Terry Jan Reedy

Terry Reedy writes:
Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions. And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so.

On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters." (And to a naive reader, that implies that string iteration and indexing should produce Unicode characters.)
By design, they contain code point strings.
For the purpose of my sentence, the same thing in that code points correspond to characters, where 'character' includes ascii control 'characters' and unicode analogs. The problem is that on narrow builds strings are NOT code point sequences. They are 2-byte code *unit* sequences. Single non-BMP code points are seen as 2 code units and hence given a length of 2, not 1. Strings iterate, index, and slice by 2-byte code units, not by code points. Python floats try to follow the IEEE standard as interpreted for Python (Python has its software exceptions rather than signalling versus non-signalling hardware signals). Python decimals slavishly follow the IEEE decimal standard. Python narrow build unicode breaks the standard for non-BMP code points and cosequently, breaks the re module even when it works for wide builds. As sys.maxunicode more or less says, only the BMP subset is fully supported. Any narrow build string with even 1 non-BMP char violates the standard.
Guido has made that absolutely clear on a number of occasions.
It is not clear what you mean, but recently on python-ideas he has reiterated that he intends bytes and strings to be conceptually different. Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays. Except they are not for non-BMP characters/codepoints. Narrow build unicode is effectively an array of two-byte binary units.
The documentation was changed at least a bit for 3.0, and anyway, as indicated above, it is easy (especially for new users) to read the docs in a way that makes the current behavior buggy. I agree that the implementation should have been changed already. Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want. PEP 393 abolishes narrow builds as we now know them and changes semantics. I was answering a complaint about that change. If you do not like the PEP, fine. My separate proposal in my other post is for an alternative implementation but with, I presume, pretty the same visible changes.
It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today.
If the implementation is invisible to the Python user, as I believe it should be without specially introspection, and mostly invisible in the C-API except for those who intentionally poke into the details, then the implementation can be changed as the consensus on best implementation changes.
Given that 3.0 unicode (string) objects are defined as Unicode character strings, I do not see the opposition.
-- Terry Jan Reedy

Terry Reedy writes:
Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary, of course. I was on your side of the fence when this was discussed, pre-release. I was wrong then. My bet is that we are still wrong, now.
For the purpose of my sentence, the same thing in that code points correspond to characters,
Not in Unicode, they do not. By definition, a small number of code points (eg, U+FFFF) *never* did and *never* will correspond to characters. Since about Unicode 3.0, the same is true of surrogate code points. Some restrictions have been placed on what can be done with composed characters, so even with the PEP (which gives us code point arrays) we do not really get arrays of Unicode characters that fully conform to the model.
strings are NOT code point sequences. They are 2-byte code *unit* sequences.
I stand corrected on Unicode terminology. "Code unit" is what I meant, and what I understand Guido to have defined unicode objects as arrays of.
Any narrow build string with even 1 non-BMP char violates the standard.
Yup. That's by design.
Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK.
Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays.
And indeed they are, in UCS-4 builds. But they are *not* in Unicode! Unicode violates the array model. Specifically, in handling composing characters, and in bidi, where arbitrary slicing of direction control characters will result in garbled display. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. Of the remaining 10%, 90% are not going to need both huge strings *and* ABI interoperability with C modules compiled for UCS-2, so UCS-4 is satisfactory. Of the remaining 1% of all applications, those that deal with huge strings *and* need full Unicode conformance, well, they need efficiency too almost by definition. They probably are going to want something more efficient than either the UTF-16 or the UTF-32 representation can provide, and therefore will need trickier, possibly app-specific, algorithms that probably do not belong in an initial implementation.
I don't. I suspect Guido does not, even today.
Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want.
Let them become developers, then, and show us how to do it better.
No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms).
A naive implementation of UTF-16 will be quite visible in terms of performance, I suspect, and performance-oriented applications will "go behind the API's back" to get it. We're already seeing that in the people who insist that bytes are characters too, and string APIs should work on them just as they do on (Unicode) strings.
I think they're not, I think they're defined as Unicode code unit arrays, and that the documentation is in error. If the documentation is correct, then Python 3.0 was released about 5 years too early, because correct handling of those objects as arrays of Unicode characters has never been implemented or even discussed in terms of proposed code that I know of. Martin has long claimed that the fact that I/O is done in terms of UTF-16 means that the internal representation is UTF-16, so I could be wrong. But when issues of slicing, len() values and so on have come up in the past, Guido has always said "no, there will be no change in semantics of builtins here".

I think what he means (and what I meant when I said something similar): I/O will consider surrogate pairs in the representation when converting to the output encoding. This is actually relevant only for UTF-8 (I think), which converts surrogate pairs "correctly". This can be taken as a proof that Python 3.2 is "UTF-16 aware" (in some places, but not in others). With Python's I/O architecture, it is of course not *actually* the I/O which considers UTF-16, but the codec. Regards, Martin

Antoine Pitrou writes:
But it's not "simple" at the level we're talking about! Specifically, *in-memory* surrogates are properly respected when doing the encoding, and therefore such I/O is not UCS-2 or "raw code units". This treatment is different from sizing and indexing of unicodes, where surrogates are not treated differently from other code points.

I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP. The primary objective is the reduction in memory usage. (any changes in runtime are also side effects, and it's not really clear yet whether you get speedups or slowdowns on average, or no effect).
That's just a description of the implementation, and not part of the language, though. My understanding is that the "abstract Python language definition" considers this aspect implementation-defined: PyPy, Jython, IronPython etc. would be free to do things differently (and I understand that there are plans to do PEP-393 style Unicode objects in PyPy).
Not with these words, though. As I recall, it's rather like (still with different words) "len() will stay O(1) forever, regardless of any perceived incorrectness of this choice". An attempt to change the builtins to introduce higher complexity for the sake of correctness is what he rejects. I think PEP 393 balances this well, keeping the O(1) operations in that complexity, while improving the cross- platform "correctness" of these functions. Regards, Martin

On 8/24/2011 1:50 PM, "Martin v. Löwis" wrote:
I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP.
Then why does the Rationale start with "on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported."? A Windows user can only solve this problem by switching to *nix.
The primary objective is the reduction in memory usage.
On average (perhaps). As I understand the PEP, for some strings, Windows users will see a doubling of memory usage. Statistically, that doubling is probably more likely in longer texts. Ascii-only Python code and other limited-to-ascii text will benefit. Typical English business documents will see no change as they often have proper non-ascii quotes and occasional accented characters, trademark symbols, and other things. I think you have the objectives backwards. Adding memory is a lot easier than switching OSes. -- Terry Jan Reedy

On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.
On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point: 1) i in range(0x11000) <broad definition> 2) "A value, or position, for a character" <narrow definition> (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(.
I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers.
The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard.
I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors.
I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing.
I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization. -- Terry Jan Reedy

Terry Reedy writes:
Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.
Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as "simple" an operation as "s1[0] == s2[0]" cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393.
No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid.
Sophistry. "Always getting the right answer" is purity.
In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module.
Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen. Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and *everything* needs to be special-cased for efficiency, and *any* small change can have show-stopping performance implications. Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional.

On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Really? If strings contain code units, that expression compares code units. What is non-conforming about comparing two code points? They are just integers. Seriously, what does Unicode-conforming mean here? It would be better to specify chapter and verse (e.g. is it a specific thing defined by the dreaded TR18?)
I don't know that we will be able to educate our users to the point where they will use code unit, code point, character, glyph, character set, encoding, and other technical terms correctly. TBH even though less than two hours ago I composed a reply in this thread, I've already forgotten which is a code point and which is a code unit.
Eh? In most other areas Python is pretty careful not to promise to "always get the right answer" since what is right is entirely in the user's mind. We often go to great lengths of defining how things work so as to set the right expectations. For example, variables in Python work differently than in most other languages. Now I am happy to admit that for many Unicode issues the level at which we have currently defined things (code units, I think -- the thingies that encodings are made of) is confusing, and it would be better to switch to the others (code points, I think). But characters are right out.
It is not so easy to change expectations about O(1) vs. O(N) behavior of indexing however. IMO we shouldn't try and hence we're stuck with operations defined in terms of code thingies instead of (mostly mythical) characters.
Let's take small steps. Do the evolutionary thing. Let's get things right so users won't have to worry about code points vs. code units any more. A conforming library for all things at the character level can be developed later, once we understand things better at that level (again, most developers don't even understand most of the subtleties, so I claim we're not ready).
Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)".
I still think that. It would be too big of a cultural upheaval to change it.
This I agree with (though if you were referring to me with "leadership" I consider myself woefully underinformed about Unicode subtleties). I also suspect that Unicode "conformance" (however defined) is more part of a political battle than an actual necessity. I'd much rather have us fix Tom Christiansen's specific bugs than chase the elusive "standard conforming". (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards." http://en.wikipedia.org/wiki/Stinking_badges :-) -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <guido@python.org> wrote:
Indeed, code points are the abstract concept and code units are the specific byte sequences that are used for serialisation (FWIW, I'm going to try to keep this straight in the future by remembering that the Unicode character set is defined as abstract points on planes, just like geometry). With narrow builds, code units can currently come into play internally, but with PEP 393 everything internal will be working directly with code points. Normalisation, combining characters and bidi issues may still affect the correctness of unicode comparison and slicing (and other text manipulation), but there are limits to how much of the underlying complexity we can effectively hide without being misleading. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Hm, code points still look pretty concrete to me (integers in the range 0 .. 2**21) and code units don't feel like byte sequences to me (at least not UTF-16 code units -- in Python at least you can think of them as integers in the range 0 .. 2**16).
Let's just define a Unicode string to be a sequence of code points and let libraries deal with the rest. Ok, methods like lower() should consider characters, but indexing/slicing should refer to code points. Same for '=='; we can have a library that compares by applying (or assuming?) certain normalizations. Tom C tells me that case-less comparison cannot use a.lower() == b.lower(); fine, we can add that operation to the library too. But this exceeds the scope of PEP 393, right? -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum <guido@python.org> wrote:
Yep, I was agreeing with you on this point - I think you're right that if we provide a solid code point based core Unicode type (perhaps with some character based methods), then library support can fill the gap between handling code points and handling characters. In particular, a unicode character based string type would be significantly easier to write in Python than it would be in C (after skimming Tom's bug report at http://bugs.python.org/issue12729, I better understand the motivation and desire for that kind of interface and it sounds like Terry's prototype is along those lines). Once those mappings are thrashed out outside the core, then there may be something to incorporate directly around the 3.4 timeframe (or potentially even in 3.3, since it should already be possible to develop such a wrapper based on UCS4 builds of 3.2) However, there may an important distinction to be made on the Python-the-language vs CPython-the-implementation front: is another implementation (e.g. PyPy) *allowed* to implement character based indexing instead of code point based for 2.x unicode/3.x str type? Or is the code point indexing part of the language spec, and any character based indexing needs to be provided via a separate type or module? Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
GvR writes:
+1 I don't really see an alternative to this approach. The underlying array has to be exposed because there are too many applications that can take advantage of it, and analysis of decomposed characters requires it. Making that array be an array of code points is a really good idea, and Python already has that in the UCS-4 build. PEP 393 is "just" a space optimization that allows getting rid of the narrow build, with all its wartiness.
I agree that it's possible, but I estimate that it's not feasible for 3.3 because we don't yet know the requirements. This one really needs to ferment and mature in PyPI for a while because we just don't know how far the scope of user needs is going to extend. Bidi is a mudball[1], confusable character indexes sound like a cool idea for the web and email but is anybody really going to use them?, etc.
+1 for language spec. Remember, there are cases in Unicode where you'd like to access base characters and the like. So you need to be able to get at individual code points in an NFD string. You shouldn't need to use different code for that in different implementations of Python. Footnotes: [1] Sure, we can implement the UAX#9 bidi algorithm, but it's not good enough by itself: something as simple as "File name (default {0}): ".format(name) can produce disconcerting results if the whole resulting string is treated by the UBA. Specifically, using the usual convention of uppercase letters being an RTL script, name = "ABCD" will result in the prompt: File name (default :(DCBA _ (where _ denotes the position of the insertion cursor). The Hebrew speakers on emacs-devel agreed that an example using a real Hebrew string didn't look right to them, either.

Most certainly. In the PEP-393 representation, the surrogate characters can readily be represented (and would imply atleast the two-byte form), but they will never take their UTF-16 function (i.e. the UTF-8 codec won't try to combine surrogate pairs), so they can be used for surrogateescape and other functions. Of course, in strict error mode, codecs will refuse to encode them (notice that surrogateescape is an error handler, not a codec). Regards, Martin

On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I would think that it should still be possible to explicitly put surrogates into a string, using the appropriate \uxxxx escape or chr(i) or some such approach; the basic string operations IMO shouldn't bother with checking for well-formed character sequences (just as they shouldn't care about normal forms). But decoding bytes from UTF-16 should not leave any surrogate pairs in, since interpreting those is part of the decoding. I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). -- --Guido van Rossum (python.org/~guido)

On Thu, 25 Aug 2011, Guido van Rossum wrote:
If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity. Hmmm, doesn't look good: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information.
'\xed\xb0\x80'.decode ('utf-8') u'\udc00'
Incorrect! Although this is a narrow build - I can't say what the wide build would do. For reasons of practicality, it may be appropriate to provide easy access to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be called UTF-8. Other variations may also find use if provided. See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan@uwaterloo.ca> wrote:
You have a point. The security issues cannot be seen separate from all the other issues. The folks inside Google who care about Unicode often harp on this. So I stand corrected. I am fine with codecs treating code points or code point sequences that the Unicode standard doesn't like (e.g. lone surrogates) the same way as more severe errors in the encoded bytes (lots of byte sequences already aren't valid UTF-8). I just hope this doesn't require normal forms or other expensive operations; I hope it's limited to rejecting invalid use of surrogates or other values that are not valid code points (e.g. 0, or >= 2**21).
Thanks for the links! I also like the term "supplemental character" (a code point >= 2**16). And I note that they talk about characters were we've just agreed that we should say code points... -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <guido@python.org> wrote:
Surrogates are used and valid only in UTF-16. In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type.
What do you mean? We use the "strict" error handler by default and we can specify other handlers already.
Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates). We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream.").
I think there shouldn't be any normalization done automatically by the codecs.
The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ). Luckily this is fixed in Python 3.x. I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me). Best Regards, Ezio Melotti

Isaac Morland, 26.08.2011 04:28:
Works the same for me in a wide Py2.7 build, but gives me this in Py3: Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
Same for current Py3.3 and the PEP393 build (although both have a better exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte"). Stefan

Stefan Behnel wrote:
The reason for this is that the UTF-8 codec in Python 2.x has never rejected lone surrogates and it was used to store Unicode literals in pyc files (using marshal) and also by pickle for transferring Unicode strings, so we could simply reject lone surrogates, since this would have caused compatibility problems. That change was made in Python 3.x by having a special error handler surrogatepass which allows the UTF-8 codec to process lone surrogates as well. BTW: I'd love to join the discussion about PEP 393, but unfortunately I'm swamped with work, so these are just a few comments... What I'm missing in the discussion is statistics of the effects of the patch (both memory and performance) and the effect on 3rd party extensions. I'm not convinced that the memory/speed tradeoff is worth the breakage or whether the patch actually saves memory in real world applications and I'm unsure whether the needed code changes to the binary Python Unicode API can be done in a minor Python release. Note that in the worst case, a PEP 393 Unicode object will save three versions of the same string, e.g. on Windows with sizeof(wchar_t)==2: A UCS4 version in str, a UTF-8 version in utf8 (this gets build whenever Python needs a UTF-8 version of the Object) and a wchar_t version in wstr (which gets build whenever Python codecs or extensions need Py_UNICODE or a wchar_t representation). On all platforms, in the case where you store a Latin-1 non-ASCII string: str holds the Latin-1 string, utf8 the UTF-8 version and wstr the 2- or 4-bytes wchar_t version. * A note on terminology: Python stores Unicode as code points. A Unicode "code point" refers to any value in the Unicode code range which is 0 - 0x10FFFF. Lone surrogates, unassigned and illegal code points are all still code points - this is a detail people often forget. Various code points in Unicode have special meanings and some are not allowed to be used in encodings, but that does not make them rule them out from being stored and processed as code points. Code units are only used in encoded versions Unicode, e.g. the UTF-8, -16, -32. Mixing code units and code points can cause much confusion, so it's better to talk only about code point when referring to Python Unicode objects, since you only ever meet code units when looking at the the bytes output of the codecs. This is important to know, since Python is not only meant to process Unicode, but also to build Unicode strings, so a careful distinction has to be made when considering what is correct and what not: codecs have to follow much more strict rules than Python itself. * A note on surrogates: These are just one particular problem where you run into the situation where splitting a Unicode string potentially breaks a combination of code points. There are a few other types of code points that cause similar problems, e.g. combining code points. Simply going with UCS-4 does not solve the problem, since even with UCS-4 storage, you can still have surrogates in your Python Unicode string. As with many things, it is important to be aware of the potential problem, but there's no automatic fix to get rid of it. What we can do, is make the best of it and this has happened already in many areas, e.g. codecs joining surrogates automatically, chr() creating surrogates, etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 26 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 39 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Guido van Rossum writes:
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
That's true out of context, but in context it's "which for most purposes can be treated as Unicode characters", and this is what Terry is concerned with, as well.
What is non-conforming about comparing two code points?
Unicode conformance means treating characters correctly. In particular, s1 and s2 might be NFC and NFD forms of the same string with a combining character at s2[1], or s1[1] and s[2] might be a non-combining character and a combining character respectively.
Seriously, what does Unicode-conforming mean here?
Chapter 3, all verses. Here, specifically C6, p. 60. One would have to define the process executing "s1[0] == s2[0]" to be sure that even in the cases cited in the previous paragraph non-conformance is occurring, but one example of a process where that is non-conforming (without additional code to check for trailing combining characters) is in comparison of Vietnamese filenames generated on a Mac vs. those generated on a Linux host.
Sure. I got it wrong myself earlier. I think that the right thing to do is to provide a conformant implementation of Unicode text in the stdlib (a long run goal, see below), and call that "Unicode", while we call strings "strings".
Yes, and AFAICT (I'm better at reading standards than I am at reading Python implementation) PEP 393 allows that.
But characters are right out.
+1
Well, O(N) is not really the question. It's really O(log N), as Terry says. Is that out, too? I can verify that it's possible to do it in practice in the long term. In my experience with Emacs, even with 250 MB files, O(log N) mostly gives acceptable performance in an interactive editor, as well as many scripted textual applications. The problems that I see are (1) It's very easy to write algorithms that would be O(N) for a true array, but then become O(N log N) or worse (and the coefficient on the O(log N) algorithm is way higher to start). I guess this would kill the idea, but. (2) Maintenance is fragile; it's easy to break the necessary caches with feature additions and bug fixes. (However, I don't think this would be as big a problem for Python, due to its more disciplined process, as it has been for XEmacs.) You might think space for the caches would be a problem, but that has turned out not to be the case for Emacsen.
I don't think anybody does. That's one reason there's a new version of Unicode every few years.
<wink/> MvL and MAL are not, however, and there are plenty of others who make contributions -- in an orderly fashion.
Well, I would advocate specifying which parts of the standard we target and which not (for any given version). The goal of full "Chapter 3" conformance should be left up to a library on PyPI for the nonce IMO. I agree that fixing specific bugs should be given precedence over "conformance chasing," but implementation should conform to the appropriate parts of the standard.
(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards." http://en.wikipedia.org/wiki/Stinking_badges :-)
RMS beat you to that. Not good company to be in, in this case: he specifically disclaims the goal of portability to non-GNU-System systems.

What is non-conforming about comparing two code points?
Unicode conformance means treating characters correctly.
Re-read the text. You are interpreting something that isn't there.
No, that's explicitly *not* what C6 says. Instead, it says that a process that treats s1 and s2 differently shall not assume that others will do the same, i.e. that it is ok to treat them the same even though they have different code points. Treating them differently is also conforming. Regards, Martin

"Martin v. Löwis" writes:
Then what requirement does C6 impose, in your opinion? It sounds like you don't think it imposes any, in practice. Note that in the discussion of C6, the standard says, - Ideally, an implementation would *always* interpret two canonical-equivalent sequences *identically*. There are practical circumstances under which implementations may reasonably distinguish them. (Emphasis mine.) The examples given are things like "inspecting memory representation structure" (which properly speaking is really outside of Unicode conformance) and "ignoring collation behavior of combining sequences outside the repertoire of a specified language." That sounds like "Special cases aren't special enough to break the rules. Although practicality beats purity." to me. Treating things differently is an exceptional case, that requires sufficient justification. My understanding is that if those strings are exchanged with an another process, then whether or not treating them differently is allowed depends on whether the results will be output to another process, and what the definition of our process is. Sometimes it will be allowed, but mostly it won't. Take file names as an example. If our process is working with an external process (the OS's file system driver) whose definition includes the statement that "File names are sequences of Unicode characters", then C6 says our process must compare canonically equivalent sequences that it takes to be file names as the same, whether or not they are in the same normalized form, or normalized at all, because we can't assume the file system will treat them as different. If we do treat them as different, our users will get very upset (eg, if we don't signal a duplicate file name input by the user, and then the OS proceeds to overwrite an existing file). Dually, having made the statement that file names are Unicode, C6 says that the OS driver must return the same file given two canonically equivalent strings that happen to have different code points in them, because it may not assume that *we* will treat those strings as different names of different files. *Users* will certainly take the viewpoint that two strings that display the same on their monitor should identify the same file when they use them as file names. Now, I'm *not* saying that Python's strings *should* conform to the Unicode standard in this respect yet (or ever, for that matter; I'm with Guido on that). I'm simply saying that the current implementation of strings, as improved by PEP 393, can not be said to be conforming. I would like to see something much more conformant done as a separate library (the Python Components for Unicode, say), intended to support users who need character-based behavior, Unicode-ly correct collation, etc., more than efficiency. Applications that need both will have to make their own way at first, either by contributing improvements to the library or by using application-specific algorithms.

Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
In IETF terminology, it's a weak SHOULD requirement. Unless there are reasons not to, equivalent strings should be treated differently. It's a weak requirement because the reasons not to treat them equivalent are wide-spread.
Ok, so let me put emphasis on *ideally*. They acknowledge that for practical reasons, the equivalent strings may need to be distinguished.
And the common justification is efficiency, along with the desire to support the representation of unnormalized strings (else there would be an efficient implementation).
It may well happen that this requirement is met in a plain Python application. If the file system and GUI libraries always return NFD strings, then the Python process *will* compare equivalent sequences correctly (since it won't ever get any other representations).
Yes, but that's the operating system's choice first of all. Some operating systems do allow file names in a single directory that are equivalent yet use different code points. Python then needs to support this operating system, despite the permission of the Unicode standard to ignore the difference.
I continue to disagree. The Unicode standard deliberately allows Python's behavior as conforming.
Wrt. normalization, I think all that's needed is already there. Applications just need to normalize all strings to a normal form of their liking, and be done. That's easier than using a separate library throughout the code base (let alone using yet another string type). Regards, Martin

On Thu, Aug 25, 2011 at 7:57 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I'd actually put it slightly differently: it seems to me that Python, in and of itself, can neither conform to nor violate that part of the standard, since conformance depends on how the *application* processes the data. However, we can make it harder or easier for applications to be conformant. UCS2 builds make it harder, since some code points have to be represented as code units internally. UCS4 builds and future PEP 393 builds (which should exhibit current UCS4 build semantics at the Python layer) make it easier, since the internal representation consistently uses code points, with code units only appearing as part of the encoding and decoding process. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

"Martin v. Löwis" writes:
There are no "weak SHOULDs" and no "wide-spread reasons" in RFC 2119. RFC 2119 specifies "particular circumstances" and "full implications" that are "carefully weighed" before varying from SHOULD behavior. IMHO the Unicode Standard intends a full RFC 2119 "SHOULD" here.
Sure, and that's one of several such reasons why I think the PEP's implementation of unicodes as arrays of code points is an optimal balance. But the Unicode standard does not "permit" ignoring the difference here, except in the sense that *the Unicode standard doesn't apply at all* and therefore doesn't forbid it. The OSes in question are not conforming processes, and presumably don't claim to be. Because most of the processes Python interacts with won't be conforming processes (not even the majority of textual applications, for a while), Python does not need to be, and *should not* be, a conforming Unicode process for most of what it does. Not even for much of its text processing. Also, to the extent that Python is a general-purpose language, I see nothing wrong and lots of good in having a non-conformant code point array type as the platform for implementing conforming Unicode library(ies). But this is not user/developer-friendly at all:
But many users have never heard of normalization. And that's *just* normalization. There is a whole raft of other requirements for conformance (collation, case, etc). The point is that with such a library and string type, various aspects of conformance to Unicode, as well as conformance to associated standards (eg, the dreaded UTS #18 ;-) can be added to the library over time, and most users (those who don't need to squeeze every ounce of performance out of Python) can be blissfully unaware of what, if anything, they're conforming to. Just upgrade the library to get the best Unicode support (in terms of conformance) that Python has to offer. But for the reasons you (and Guido and Nick and ...) give, it's not reasonable to put all that into core Python, not anytime soon. Not to mention that as a work-in-progress, it can hardly be considered stable enough for the stdlib. That is what Terry Reedy is getting at, AIUI. "Batteries included" should mean as much Unicode conformance as we can reasonably provide should be *conveniently* available. The ideal (given the caveat about efficiency) would be *one* import statement and a ConformingUnicode type that acts "just like a string" in all ways, except that (1) it indexes and counts on characters (preferably "grapheme clusters" :-), (2) does collation, regexps, and the like conformant to the Unicode standard, and (3) may be quite inefficient from the point of view of bit- shoveling net applications and the like. Of course most of (2) is going to take quite a while, but (1) and (3) should not be that hard to accomplish (especially (3) ;-).
That's up to you. I doubt very many users or application developers will see it that way, though. I think they would prefer that we be conservative about what we call "conformant", and tell them precisely what they need to do to get what they consider conformant behavior from Python. That's easier if we share definitions of conformant with them. And surely there would be great joy on the battlements if there were a one-import way to spell "all the Unicode conformance you can give me, please". The problem with your legalistic approach, as I see it, is that if our definition is looser than the users', all their surprises will be unpleasant. That's not good.

On Thu, Aug 25, 2011 at 4:58 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I see no alternative to explicitly spelling out what all operations do and let the user figure out whether that meets their needs. E.g. we needn't say that the str type or its == operator conforms to the Unicode standard. We just need to say that the string type is a sequence of code points, that string operations don't do validation or normalization, and that to do a comparison that takes the Unicode std's definition of equivalence (or collation, etc.) into account you must call a certain library method. -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 2:39 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Does any OS actually say that? Don't they usually say "in a specific normal form" or "they're just bytes"?
The solution here is to let the OS do the check, e.g. with os.path.exists() or os.stat(). It would be wrong to write an app that checked for file existence by doing naive lookups in os.listdir() output. -- --Guido van Rossum (python.org/~guido)

Le 25/08/2011 06:12, Stephen J. Turnbull a écrit :
It took some weeks (months?) to write the PEP, and months to implement it. This PEP is only a minor change of the implementation of Unicode in Python. A larger change will take much more time (and maybe change/break the C and/or Python API a little bit more). If you are able to implement your specfication (a Unicode type with a "real" character API), please write a PEP and implement it. You may begin with a prototype in Python, and then rewrite it in C. But I don't think that any core developer will do that for you. It's not how free software works. At least, I don't think that anyone will do that for free :-) (I bet that many developers will accept to implement that for money :-)) Victor

On 8/24/2011 7:29 PM, Guido van Rossum wrote:
(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards."http://en.wikipedia.org/wiki/Stinking_badges :-)
Which deserves an appropriate, follow-on, misquote: Guido says the Unicode standard stinks. ˚͜˚ <- and a Unicode smiley to go with it!

I think he's referring to combining characters and normal forms. 2.12 starts with "In cases involving two or more sequences considered to be equivalent, the Unicode Standard does not prescribe one particular sequence as being the correct one; instead, each sequence is merely equivalent to the others" That could be read to imply that the == operator should determine whether two strings are equivalent. However, the Unicode standard clearly leaves API design to the programming environment, and has the notion of conformance only for processes. So saying that Python is or is not unicode-conforming is, strictly speaking, meaningless. The closest conformance requirement in that respect is C6 "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." However, that explicitly does *not* support the conformance statement that Stephen made. They elaborate "Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them." So practicality beats purity even in Unicode conformance: the == operator of Python can reasonably treat equivalent strings as unequal (and there is a good reason for that, indeed). Processes should not expect that other applications make the same distinction, so they need to cope if it matters to them. There are different way to do that: - normalize all strings on input, and then use == - use a different comparison operation that always normalizes its input first
Fortunately, it's much better than that. Unicode had very clear conformance requirements for a long time, and they aren't hard to meet. Wrt. C6, Python could certainly improve, e.g. by caching whether a string had been determined to be in normal form, so that applications can more reasonably apply normalization to all strings they ever want to compare. Regards, Martin

On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjreedy@udel.edu> wrote:
The naive reader also doesn't know the difference between characters, code points and code units. It's the advanced, Unicode-aware reader who is confused by this phrase in the docs. It should say code units; or perhaps code units for narrow builds and code points for wide builds. With PEP 393 we can unconditionally say code points, which is much better. We should try to remove our use of "characters" -- or else we should *define* our use of the term "characters" as "what the Unicode standard calls code points". -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <guido@python.org> wrote:
For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that: * for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but * only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair"). Since "code unit" refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4 bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code points" overlaps only for the ASCII range).
Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me. Note that there's also another important term[1]: """ *Unicode Scalar Value*. Any Unicode * code point<http://unicode.org/glossary/#code_point> * except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. """ For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]: Chapter 3 [4] says: """ 3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. """ On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates. I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.) Best Regards, Ezio Melotti [0]: From the chapter 3 [4], D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. [1]: http://unicode.org/glossary/#unicode_scalar_value [2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32.

On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti <ezio.melotti@gmail.com> wrote:
The more I think about it the more it seems to me that the biggest problem is that in narrow builds it is ambiguous whether (unicode) strings contain code units, i.e. are *encoded* code points, or whether they contain (decoded) code points. In a sense this is repeating the ambiguity of 8-bit strings in Python 2, which are sometimes assumed to contain ASCII or Latin-1 (i.e., code points with a limited range) or UTF-8 (i.e., code units). I know that by now I am repeating myself, but I think it would be really good if we could get rid of this ambiguity. PEP 393 seems the best way forward, even if it doesn't directly address what to do for IronPython or Jython, both of which have to deal with a pervasive native string type that contains UTF-16. IIUC, CPython on Windows will work just fine with PEP 393, even if it means that there is a bit more translation between Python strings and the OS native wchar_t[] type. I assume that the data volumes going through the OS APIs is relatively constrained, since data actually written to or read from a file will still be bytes, possibly run through a codec (if it's a text file), and not go through one of the wchar_t[] APIs -- the latter are used for things like filenames, which are much smaller.
Actually I think UTF-8 is best thought of as an encoding for code points, not characters -- the subtle difference between these two should be of no concern to the UTF-8 codec (unless it is a validating codec).
We may well have no choice -- there is just too much documentation that naively refers to characters while really referring to code units or code points.
This seems to involve validation. I think all validation should be sequestered to specific APIs (e.g. certain codecs) and the string type should not care about it. Depending on what they are doing, applications may have to be aware of many subtleties in order to always avoid generating "invalid" (or not well-formed-- what's the difference?) strings.
I really don't mind whether our codecs actually make exceptions for surrogates (lone or otherwise). The only requirement I care about is that surrogate-free strings round-trip correctly. Again, apps that want to conform to the requirements regarding surrogates can implement their own validation, and certainly at some point we should offer a validation library as part of the stdlib -- but it should be up to the app whether and when to use it.
Right.
I'm not more confused than I was, but I think we should reduce the number of Unicode terms we care about rather than increase them. If we only ever had to talk about code points and encoded byte sequences I'd be happy -- although in practice we also need to acknowledge the existence of characters that may be represented by multiple code points, since islower(), lower() etc. may need these (and also the re module). Other concepts we may have to at least acknowledge include various normal forms, equivalence, and collation sequences (which are language-dependent?). It would be lovely if someone wrote up an informational PEP so that we don't all have to lug around a copy of the Unicode standard.
-- --Guido van Rossum (python.org/~guido)

On 26 August 2011 03:52, Guido van Rossum <guido@python.org> wrote:
Hmm, I'm completely naive in this area, but from reading the thread, would a possible approach be to say that Python (the language definition) is defined in terms of code points (as we already do, even if the wording might benefit from some clarification). Then, under PEP 393, and currently in wide builds, CPython conforms to that definition (and retains the property of basic operations being O(1), which is not in the language definition but is a user expectation and your expressed requirement). IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. Presumably this will be easier than moving to a UCS-4 representation, as they can defer to runtime support routines via interop (which presumably get this right - or at the very least can be blamed for any errors :-)) They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics. Does this make sense, or have I completely misunderstood things? Paul. PS Thanks to all for the discussion in general, I'm learning a lot about Unicode from all of this!

That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property, and it may take well years until somebody provides an efficient unmaintainable implementation.
Does this make sense, or have I completely misunderstood things?
You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. However, non-conformance may not be that much of an issue. They do not conform in many other aspects, either (such as not supporting Python 3, for example, or not supporting the C API) that they may well chose to ignore such a minor requirement if there was one. For BMP strings, they conform fine, and it may well be that Jython eithers either don't have non-BMP strings, or don't care whether len() or indexing of their non-BMP strings is "correct". Regards, Martin

"Martin v. Löwis", 26.08.2011 11:29:
You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not.
I think we can leave this discussion aside. Jython and IronPython have their own platform specific constraints to which they need to adapt their implementation. For a Jython user, it means a lot to be able to efficiently pass strings (and other data) back and forth between Jython and other JVM code, and it's not hard to guess that the same is true for IronPython/.NET users. After all, the platform integration is the very *reason* for most users to select one of these implementations. Besides, what if these implementations provided indexing in, say, O(log N) instead of O(1) or O(N), e.g. by building a tree index into each string? You could have an index that simply marks runs of surrogate pairs and BMP substrings, thus providing a likely-to-be-somewhat-compact index. That index would obviously have to be built, but so do the different string representations in post-PEP-393 CPython, especially on Windows, as I have learned. Would such a less severe violation of the strict O(1) rule still be "not ok"? I think this is not such a clear black-and-white issue. Both implementations have notably different performance characteristics than CPython in some more or less important areas, as does PyPy. At some point, the language compliance label has to account for that. Stefan

On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
(And yet, you keep arguing. :-)
Right.
Eek. No, please. Those platforms' native string types have length and slicing operations that are O(1) and work in terms of 16-bit code points. Python should use those. It would be awful if Java and Python code doing the same manipulations on the same string would come to different conclusions because Python tried to paper over surrogates. I dug up some evidence for Java, at least: http://download.oracle.com/javase/1,5.0/docs/api/java/lang/CharSequence.html... """ length int length() Returns the length of this character sequence. The length is the number of 16-bit chars in the sequence. Returns: the number of chars in this sequence """ This is quite explicit about counting 16-bit code units. I've found similar info about .NET, which defines "char" as a 16-bit quantity and string length in terms of the number of "char" items.
Since you had to ask, I have to declare that, indeed, non-O(1) behavior would not be okay for those platforms. All in all, I don't think we should legislate Python strings to be able to support 21-bit code points using O(1) indexing. PEP 393 makes this possible for CPython, and it's been said that PyPy can follow suit. But it'll be a "quality-of-implementation" issue, not built into the language spec. -- --Guido van Rossum (python.org/~guido)

On 26 August 2011 17:51, Guido van Rossum <guido@python.org> wrote:
On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
(Regarding my comments on code point semantics)
On 26 August 2011 18:02, Guido van Rossum <guido@python.org> wrote:
*That* is actually the erroneous assumption I had made - that the Java and .NET native string type had code point semantics (i.e., took surrogates into account). As that isn't the case, my comments aren't valid - and I agree that having common semantics (and hence exposing surrogates) is too important to lose. On the other hand, that pretty much establishes that whatever PEP 393 achieves in terms of allowing all builds of CPython to offer code point semantics, the language definition can't mandate it. Thanks for the clarification. Paul.

On Fri, Aug 26, 2011 at 10:13 AM, Paul Moore <p.f.moore@gmail.com> wrote:
Those platforms probably *also* have libraries of operations to support writing apps that conform to the Unicode standard. But those apps will have to be aware of the difference between the "naive" length of a string and the number of code points of characters in it.
The most severe consequence to me seems that the stdlib (which is reused by those other platforms) cannot assume CPython's ideal world -- even if specific apps sometimes can. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 26.08.2011 19:02:
I was mostly just confabulating. My main point was that this isn't a black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. You can achieve compliant/acceptable behaviour at the code point level, the performance guarantees level or the platform integration level - choose any two. CPython is just lucky that there isn't really a platform integration level to take into account (if we leave the Windows environment aside for a moment).
I fully agree.
I take it that you say that because you want strings to perform in the 'normal' platform specific way here (i.e. like Java/.NET strings), and not so much because you want to require the exact same (performance) characteristics across Python implementations. So your choice is platform integration over code points, leaving the identical performance as a side-effect of the platform integration.
Makes sense to me. Most likely, Unicode heavy Python code will have to take platform specifics into account anyway, so there are limits as to what is suitable for a language spec. Stefan

On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Indeed.
I think this is fine. I had been hoping that all Python implementations claiming compatibility with version 3.3 of the language reference would be free of worries about surrogates, but it simply doesn't make sense. And yes, I'm well aware that PEP 393 is only for CPython. It's just that I had hoped that it would get rid of some of Tom C's specific complaints for all Python implementations; but it really seems impossible to do so. One consequence may be that the standard library, to the extent it is shared by other implementations, may still have to worry about surrogates and other issues inherent in narrow builds or other 16-bit-based string types. We'll cross that bridge when we get to it. -- --Guido van Rossum (python.org/~guido)

On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points, which is why Python's narrow builds are called UCS-2 and not UTF-16.
That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property,
Given that both 'efficient' and 'maintainable' are relative terms, that is you pessimistic opinion, not really a fact.
Why do you keep saying that O(n) is the alternative? I have already given a simple solution that is O(logk), where k is the number of non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I suspect that is the most time efficient possible without using at least as much space as a UCS-4 solution. The fact that you and other do not want this for CPython should not preclude other implementations that are more tied to UTF-16 from exploring the idea. Maintainability partly depends on whether all-codepoint support is built in or bolted on to a BMP-only implementation burdened with back compatibility for a code unit API. Maintainability is probably harder with a separate UTF-32 type, which CPython has but which I gather Jython and Iron-Python do not. It might or might not be easier is there were a separate internal character type containing a 32 bit code point value, so that interation and indexing (and single char slicing) always returned the same type of object regardless of whether the character was in the BMP or not. This certainly would help all the unicode database functions. Tom Christiansen appears to have said that Perl is or will use UTF-8 plus auxiliary arrays. If so, we will find out if they can maintain it. --- Terry Jan Reedy

On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I don't think anyone else has that impression. Please cite chapter and verse if you really think this is important. IIUC, UCS-2 does not allow surrogate pairs, whereas Python (and Java, and .NET, and Windows) 16-bit strings all do support surrogate pairs. And they all have a len or length function that counts code units, not code points.
Their API style is completely different from ours. What Perl can maintain has little bearing on what Python can. -- --Guido van Rossum (python.org/~guido)

On 8/26/2011 8:42 PM, Guido van Rossum wrote:
On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy<tjreedy@udel.edu> wrote:
For that reason, I think UTF-16 is a better term that UCS-2 for narrow builds (whether or not the above impression is true). But Marc Lemburg disagrees. http://mail.python.org/pipermail/python-dev/2010-November/105751.html The 2.7 docs still refer to usc2 builds, as is his wish. --- Terry Jan Reedy

On Aug 26, 2011, at 8:51 PM, Terry Reedy wrote:
I agree. It's weird to call something UCS-2 if code points above 65535 are representable. The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode. "The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP." Raymond

Raymond Hettinger writes:
The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode.
Sure. The operative word here is "codec", not "str", though.
Since when can s[0] represent a code point outside the BMP, for s a Unicode string in a narrow build? Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about what Python supports vs. the outside world. It's about what the str/ unicode type is an array of.

Antoine Pitrou writes:
Because what the outside world sees is produced by codecs, not by str. The outside world can't see whether you have narrow or wide unless it uses indexing ... ie, experiments to determine what the str type is an array of. The problem with a narrow build (whether for space efficiency in CPython or for platform compatibility in Jython and IronPython) is not that we have no UTF-16 codecs. It's that array ops aren't UTF-16 conformant.

Antoine Pitrou writes:
Sorry, what is a conformant UTF-16 array op?
For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16. (This is not a Unicode standard definition, it's intended to be suggestive of why many app writers will be distressed if they must use Python unicode/str in a narrow build without a fairly comprehensive library that wraps the arrays in operations that treat unicode/str as an array of code points.)

Guido van Rossum writes:
On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Well, that's why I wrote "intended to be suggestive". The Unicode Standard does not specify at all what the internal representation of characters may be, it only specifies what their external behavior must be when two processes communicate. (For "process" as used in the standard, think "Python modules" here, since we are concerned with the problems of folks who develop in Python.) When observing the behavior of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or even UTF-32 arrays; only arrays of characters. Thus, according to the rules of handling a UTF-16 stream, it is an error to observe a lone surrogate or a surrogate pair that isn't a high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and C8-C10). That's what I mean by "can't tell it's UTF-16". And I understand those requirements to mean that operations on UTF-16 streams should produce UTF-16 streams, or raise an error. Without that closure property for basic operations on str, I think it's a bad idea to say that the representation of text in a str in a pre-PEP-393 "narrow" build is UTF-16. For many users and app developers, it creates expectations that are not fulfilled. It's true that common usage is that an array of code units that usually conforms to UTF-16 may be called "UTF-16" without the closure properties. I just disagree with that usage, because there are two camps that interpret "UTF-16" differently. One side says, "we have an array representation in UTF-16 that can handle all Unicode code points efficiently, and if you think you need more, think again", while the other says "it's too painful to have to check every result for valid UTF-16, and we need a UTF-16 type that supports the usual array operations on *characters* via the usual operators; if you think otherwise, think again." Note that despite the (presumed) resolution of the UTF-16 issue for CPython by PEP 393, at some point a very similar discussion will take place over "characters" anyway, because users and app developers are going to want a type that handles composition sequences and/or grapheme clusters for them, as well as comparison that respects canonical equivalence, even if it is inefficient compared to str. That's why I insisted on use of "array of code points" to describe the PEP 393 str type, rather than "array of characters".

On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:
On topic: So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways: Arrays are not suitable for manipulating Unicode character sequences, and the str type is an array with a veneer of text manipulation operations, which do not, and cannot, by themselves, efficiently implement Unicode character sequences. Python wants to, should, and can implement UTF-16 streams, UTF-8 streams, and UTF-32 streams. It should, and can implement streams using other encodings as well, and also binary streams. Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 64-bit arrays. These are efficient to access, index, and slice. Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays called str (this will be more true post-PEP 393, although it is true with caveats presently), which interpret array elements as code units (currently) or codepoints (post-PEP), and implements operations that are interesting for text processing, with caveats. There is presently no support for arrays of Unicode grapheme clusters or composed characters. The Python type called str may or may not be properly documented (to the extent that there is confusion between the actual contents of the elements of the type, and the concept of character as defined by Unicode). From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster concepts. The str type itself can presently be used to process other character encodings: if they are fixed width < 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they are, and some operations on str may operate with knowledge of some Unicode semantics, so there are caveats. So it seems that any semantics in support of composed characters, grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be created as either an add-on Python package (in Python) or C extension, or a combination. It could be based on extensions to the existing str type, or it could be based on the array type, or it could based on the bytes type. It could use an internal format of 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or 16-bit codeunits. In addition to the expected stream operations, character length, indexing, and slicing operations, additional more complex operations would be expected on Unicode string values: regular expressions, comparisons, collations, case-shifting, and perhaps more. RTL and LTR awareness would add complexity to all operations, or at least variants of all operations. The questions are: 1) Is anyone interested in writing a PEP for such a thing? 2) Is anyone interested in writing an implementation for such a thing? 3) How many conflicting opinions and arguments will be spawned, making the poor person or persons above lose interest? Brainstorming ideas (which may wander off-topic in some regards, but were all inspired by this discussion): BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is smallest on the average language, and an implementation based on a foundation type of bytes or 'B' arrays, plus some side indexes of some sort, could be an appropriate starting point. UTF-8 is variable length, but so are composed characters and grapheme clusters. Building an array, each of whose units could hold the largest grapheme cluster would seem extremely inefficient, just like 32-bit Unicode is extremely inefficient for dealing with ASCII, so variable length units seem to be an imperative part of a solution. At least until one thinks up BI-2. BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache character characteristics from various character attribute database lookups could be an effective alternative, but wouldn't eliminate the need for dealing with variable length units for length, indexing, and slicing operations. BI-2: Maybe a 32-bit base would be useful so that one high bit could be used to flag that this character position actually holds an index to a multi-codepoint character, and the index would then hold the actual codes for that character. This would allow for at most 2^31 (and memory limited) different multi-codepoint characters in a string (or perhaps per application, if the multi-codepoint characters are shared between strings), but would suddenly allow array indexing of grapheme clusters and composed characters... with double-indexing required for multi-codepoint character access. [This idea seems similar to one that was mentioned elsewhere in this thread, suggesting that private use characters could be used to represent multi-codepoint characters, but (a) doesn't infringe on private uses, and (b) allows for a greater number of multi-codepoint characters to be used.] BI-3: both BI-1 and BI-2 would also allow themselves to be built on top of PEP 393 str... allowing multi-codepoint-character-supporting applications to benefit from the space efficiencies of PEP 393 when no multi-codepoint characters are fed into the application. BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array elements might be appropriate, rather than 32-bit. BI-2 could still operate, with a theoretical reduction to 2^23 possible multi-codepoint characters in an application. Access would be less efficient, but still O(1), and 25% of the space would be saved. This idea could be applied to PEP 393 independently of multi-codepoint character support. BI-5: I'm pretty sure there are inappropriate or illegal sequences of combining characters that should not stand alone. One example of this is lone surrogates. Such characters out of an appropriate sequence could be flagged with a high-bit so that they could be quickly recognized as illegal Unicode, but codecs could be provided to allow them to round-trip, and applications could recognize immediately that they should be handled as "binary gibberish" in an otherwise Unicode stream. This idea could be applied to PEP 393 independently of additional multi-codepoint character support. BI-6: Maybe another high bit could be used with a different codec error handler instead of using lone surrogates when decoding not-quite-conformant byte streams (such as OS filenames). Sad we didn't think of this one before doing all the lone surrogate stuff. Of course, this solution wouldn't work on narrow builds, because not even surrogates can represent high bits above Unicode codepoints! But once we have PEP 393, we _could_ replace inappropriate use of lone surrogates, with use of out-of-the-Unicode-codepoint range integers, without introducing ambiguity in the interpretation of lone surrogates. This idea could be applied to PEP 393 independently of multi-codepoint character support. Glenn

Glenn Linderman writes:
IMO, that would be a bad idea, as higher-level Unicode support should either be a wrapper around full implementations such as ICU (or platform support in .NET or Java), or written in pure Python at first. Thus there is a need for an efficient array of code units type. PEP 393 allows this to go to the level of code points, but evidently that is inappropriate for Jython and IronPython.
The str type itself can presently be used to process other character encodings:
Not really. Remember, on input codecs always decode to Unicode and on output they always encode from Unicode. How do you propose to get other encodings into the array of code units?
In theory yes, but in practice all of the string methods and libraries like re operate on str (and often but not always bytes; in particular, codecs always decode from byte and encode to bytes). Why bother with anything except arrays of code points at the start? PEP 393 makes that time-efficient and reasonably space-efficient as a starting point and allows starting with re or MRAB's regex to get basic RE functionality or good UTS #18 functionality respectively. Plus str already has all the usual string operations (.startswith(), .join(), etc), and we have modules for dealing with the Unicode Character Database. Why waste effort reintegrating with all that, until we have common use cases that need more efficient representation? There would be some issue in coming up with an appropriate UTF-16 to code point API for Jython and IronPython, but Terry Reedy has a rather efficient library for that already. So this discussion of alternative representations, including use of high bits to represent properties, is premature optimization ... especially since we don't even have a proto-PEP specifying how much conformance we want of this new "true Unicode" type in the first place. We need to focus on that before optimizing anything.

On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:
OK you agree with Guido.
Here are two ways, there may be more: custom codecs, direct assignment
String methods could be reimplemented on any appropriate type, of course. Rejecting alternatives too soon might make one miss the best design.
Yes, Terry's implementation is interesting, and inspiring, and that concept could be extended to a variety of interesting techniques: codepoint access of code unit representations, and multi-codepoint character access on top of either code unit or codepoint representations.
You may call it premature optimization if you like, or you can ignore the concepts and emails altogether. I call it brainstorming for ideas, looking for non-obvious solutions to the problem of representation of Unicode. I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when processes communicate and for storage (which is one way processes communicate), that doesn't imply that the internal representation of character strings in a programming language must use exactly that representation. While there are efficiencies in using the same representation as is used by the communications streams, there are also inefficiencies. I'm unaware of any current Python implementation that has chosen to use UTF-8 as the internal representation of character strings (I'm also aware Perl has made that choice), yet UTF-8 is one of the commonly recommend character representations on the Linux platform, from what I read. So in that sense, Python has rejected the idea of using the "native" or "OS configured" representation as its internal representation. So why, then, must one choose from a repertoire of Unicode-defined stream representations if they don't meet the goal of efficient length, indexing, or slicing operations on actual characters?

Glenn Linderman writes:
That is true, and Unicode is *very* careful to define its requirements so that is true. That doesn't mean using an alternative representation is an improvement, though.
There are two reasons for that. First, widechar representations are right out for anything related to the file system or OS, unless you are prepared to translate before passing to the OS. If you use UTF-8, then asking the user to use a UTF-8 locale to communicate with your app is a plausible way to eliminate any translation in your app. (The original moniker for UTF-8 was UTF-FSS, where FSS stands for "file system safe.") Second, much text processing is stream-oriented and one-pass. In those cases, the variable-width nature of UTF-8 doesn't cost you anything. Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. It costs *them* nothing and is file-system-safe.
I can't agree with that characterization. POSIX defines the concept of *locale* precisely because the "native" representation of text in Unix is ASCII. Obviously that won't fly, so they solved the problem in the worst possible way<wink/>: they made the representation variable! It is the *variability* of text representation that Python rejects, just as Emacs and Perl do. They happen to have chosen six different representations.[1]
One need not. But why do anything else? It's not like the authors of that standard paid no attention to various concerns about efficiency and backward compatibility! That's the question that you have not answered, and I am presently lacking in any data that suggests I'll ever need the facilities you propose. Footnotes: [1] Emacs recently changed its mind. Originally it used the so-called MULE encoding, and now a different extension of UTF-8 from Perl. Of course, Python beats that, with narrow, wide, and now PEP-393 representations!<wink />

Stephen J. Turnbull:
... Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text.
Qt uses UTF-16 for its basic QString type. While QString is mostly treated as a black box which you can create from input buffers in any encoding, the only encoding allowed for a contents-by-reference QString (QString::fromRawData) is UTF-16. http://doc.qt.nokia.com/latest/qstring.html#fromRawData Neil

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
I think this is too strong. The str type is indeed an array, and you can build useful Unicode manipulation APIs on top of it. Just like bytes are not UTF-8, but can be used to represent UTF-8 and a fully-compliant UTF-8 codec can be implemented on top of it. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:12 AM, Guido van Rossum wrote:
This statement is a logical conclusion of arguments presented in this thread. 1) Applications that wish to do grapheme access, wish to do it by grapheme array indexing, because that is the efficient way to do it. 2) As long as str is restricted to holding Unicode code units or code points, then it cannot support grapheme array indexing efficiently. I have not declared that useful Unicode manipulations APIs cannot be built on top of str, only that efficiency will suffer.

On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman <v+python@g.nevcal.com>wrote:
I don't believe that should be taken as gospel. In Perl, they don't do array indexing on strings at all, and use regex matching instead. An API that uses some kind of cursor on a string might work fine in Python too (for grapheme matching). 2) As long as str is restricted to holding Unicode code units or code
But you have not proven it. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 11:56 AM, Guido van Rossum wrote:
The last benchmark I saw, regexp in Perl is faster than regexp in Python; that was some years back, before regexp in Perl supported quite as much Unicode as it does now; not sure if someone has done a recent performance benchmarks; Tom's survey indicates that the functionality presently differs, so it is not clear if performance benchmarks are presently appropriate to attempt to measure Unicode operations in regexp between the two languages. That said, regexp, or some sort of cursor on a string, might be a workable solution. Will it have adequate performance? Perhaps, at least for some applications. Will it be as conceptually simple as indexing an array of graphemes? No. Will it ever reach the efficiency of indexing an array of graphemes? No. Does that matter? Depends on the application.
Do you disagree that indexing an array is more efficient than manipulating strings with regex or binary trees? I think not, because you are insistent that array indexing of str be preserved as O(1). I agree that I have not proven it; it largely depends on whether or not indexing by grapheme cluster is a useful operation in applications. Yet Stephen (I think) has commented that emacs performance goes down as soon as multi-byte characters are introduced into an edit buffer. So I think he has proven that efficiency can suffer, in some implementations/applications. Terry's O(k) implementation requires data beyond strings, and isn't O(1).

Glenn Linderman:
Using an iterator for cluster access is a common technique currently. For example, with the Pango text layout and drawing library, you may create a PangoLayoutIter over a text layout object (which contains a UTF-8 string along with formatting information) and iterate by clusters by calling pango_layout_iter_next_cluster. Direct access to clusters by index is not as useful in this domain as access by pixel positions - for example to examine the portion of a layout visible in a window. http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layo... In this API, 'index' is used to refer to a byte index into UTF-8, not a character or cluster index. Rather than discuss functionality in the abstract, we need some use cases involving different levels of character and cluster access to see whether providing indexed access is worthwhile. I'll start with an example: some text drawing engines draw decomposed characters ("o" followed by " ̈" -> "ö") differently compared to their composite equivalents ("ö") and this may be perceived as better or worse. I'd like to offer an option to replace some decomposed characters with their composite equivalent before drawing but since other characters may look worse, I don't want to do a full normalization. The API style that appears most useful for this example is an iterator over the input string that yields composed and decomposed character strings (that is, it will yield both "ö" and "ö"), each character string is then converted if in a substitution dictionary and written to an output string. This is similar to an iterator over grapheme clusters although, since it is only aimed at composing sequences, the iterator could be simpler than a full grapheme cluster iterator. One of the benefits of iterator access to text is that many different iterators can be built without burdening the implementation object with extra memory costs as would be likely with techniques that build indexes into the representation. Neil

Guido van Rossum:
No, since normalization of all cases may actually lead to worse visuals in some situations. A potential reason for drawing decomposed characters differently is that more room may be allocated for the generic condition where a character may be combined with a wide variety of accents compared with combining it with a specific accent. Here is an example on Windows drawing composite and decomposed forms to show the types of difference often encountered. http://scintilla.org/Composite.png Now, this particular example displays both forms quite reasonably so would not justify special processing but I have seen on other platforms and earlier versions of Windows where the umlaut in the decomposed form is displaced to the right even to the extent of disappearing under the next character. In the example, the decomposed 'o' is shorter and lighter and the umlauts are round instead of square. Neil

On Wed, Aug 31, 2011 at 6:29 PM, Neil Hodgson <nyamatongwe@gmail.com> wrote:
Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word.
I'm not sure it's a good idea to try and improve on the font using such a hack. But I won't deny you have the right. :-) -- --Guido van Rossum (python.org/~guido)

Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word.
If I understood the example correctly, he needs a mixed form, with some characters decomposed and some composed (depending on which one looks better in the given font). I agree that this sound more like a font problem, but it's a wide spread font problem and it may be necessary to address it in an application. But this is only one example of why an application-specific concept of graphemes different from the Unicode-defined normalized forms can be useful. I think the very concept of a grapheme is context, language, and culture specific. For example, in Chinese Pinyin it would be very natural to write tone marks with composing diacritics (i.e. in decomposed form). But then you have the vowel "ü" and it would be strange to decompose it into an "u" and combining diaeresis. So conceptually the most sensible representation of "lǜ" would be neither the composed not the decomposed normal form, and depending on its needs an application might want to represent it in the mixed form (composing the diaeresis with the "u", but leaving the grave accent separate). There must be many more examples where the conceptual context determines the right composition, like for "ñ", which is Spanish is certainly a grapheme, but in mathematics might be better represented as n-tilde. The bottom line is that, while an array of Unicode code points is certainly a generally useful data type (and PEP 393 is a great improvement in this regard), an array of graphemes carries many subtleties and may not be nearly as universal. Support in the spirit of unicodedata's normalization function etc. is certainly a good thing, but we shouldn't assume that everyone will want Python to do their graphemes for them. - Hagen

On 8/31/2011 5:58 PM, Neil Hodgson wrote:
I agree that different applications may have different needs for different types of indexes to various starting points in a large string. Where a custom index is required, a standard index may not be needed.
How many different iterators into the same text would be concurrently needed by an application? And why? Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator. Of course, if it does I/O it needs codec access, but that is by nature sequential from the starting point to the end point.

Glenn Linderman:
I would expect that there would mostly be a single iterator into a string but can imagine scenarios in which multiple iterators may be concurrently active and that these could be of different types. For example, say we wanted to search for each code point in a text that fails some test (such as being a member of a set of unwanted vowel diacritics) and then display that failure in context with its surrounding text of up to 30 graphemes either side. Neil

Glenn Linderman writes:
How many different iterators into the same text would be concurrently needed by an application? And why?
A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the "source" window and one for the "rendered" window. One might want to save the state of the iterators (if that's possible) and cache it as one moves the "window" forward to make short backward motion fast, giving you two (or four, etc) more.
`save-region' ? `save-text-remove-markup' ?

On 9/1/2011 2:15 AM, Stephen J. Turnbull wrote:
Sure. But those are probably all the same type of iterators — probably (since they are WYSIWYG) dealing with multi-codepoint characters (Guido's recent definition of grapheme, which seems to subsume both grapheme clusters and composed characters). Hence all of them would be using/requiring the same sort of representation, index, analysis, or some combination of those.
Yes, save-region sounds like exactly what I was speaking of. save-text-remove-markup I would infer needs to process the text to remove the markup characters... since you used TeX and HTML as examples, markup is text, not binary (which would be a different problem). Since the TeX and HTML markup is mostly ASCII, markup removal (or more likely, text extraction) could be performed via either a grapheme iterator, or a codepoint iterator, or even a code unit iterator.

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
Actually, the str type in Python 3 and the unicode type in Python 2 are constrained everywhere to either 16-bit or 21-bit "characters". (Except when writing C code, which can do any number of invalid things so is the equivalent of assuming 1 == 0.) In particular, on a wide build, there is no way to get a code point >= 2**21, and I don't want PEP 393 to change this. So at best we can use these types to repesent arrays of 21-bit unsigned ints. But I think it is more useful to think of them as always representing "some form of Unicode", whether that is UTF-16 (on narrow builds) or 21-bit code points or perhaps some vaguely similar superset -- but for those code units/code points that are representable *and* valid (either code points or code units) according to the (supported version of) the Unicode standard, the meaning of those code points/units matches that of the standard. Note that this is different from the bytes type, where the meaning of a byte is entirely determined by what it means in the programmer's head. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:20 AM, Guido van Rossum wrote:
Sorry, my Perl background is leaking through. I didn't double check that str constrains the values of each element to range 0x110000 but I see now by testing that it does. For some of my ideas, then, either a subtype of str would have to be able to relax that constraint, or str would not be the appropriate base type to use (but there are other base types that could be used, so this is not a serious issue for the ideas). I have no problem with thinking of str as representing "some form of Unicode". None of my proposals change that, although they may change other things, and may invent new forms of Unicode representations. You have stated that it is better to document what str actually does, rather than attempt to adhere slavishly to Unicode standard concepts. The Unicode Consortium may well define legal, conforming bytestreams for communicating processes, but languages and applications are free to use other representations internally. We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams, or we can invent a representation (whether called str or something else) that is useful and efficient in practice.

Glenn Linderman writes:
We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams,
It's not artificial. Having the internal representation be the same as a standard encoding is very useful for a large number of minor usages (urgently saving buffers in a text editor that knows its internal state is inconsistent, viewing strings in the debugger, PEP 393-style space optimization is simpler if text properties are out-of-band, etc).
or we can invent a representation (whether called str or something else) that is useful and efficient in practice.
Bring on the practice, then. You say that a bit to identify lone surrogates might be useful or efficient. In what application? How much time or space does it save? You say that a bit to cache a property might be useful or efficient. In what application? Which properties? Are those properties a set fixed by the language, or would some bits be available for application-specific property caching? How much time or space does that save? What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393? I know of no answers (none!) to those questions that favor introduction of a bit-cache representation now. And those bits aren't going anywhere; it will always be possible to use a "wide" build and change the representation later, if the optimization is valuable enough. Now, I'm aware that my experience is limited to the implementations of one general-purpose language (Emacs Lisp) of retricted applicability. But its primary use *is* in text processing, so I'm moderately expert. *Moderately*. Always interested in learning more, though. If you know of relevant use cases, I'm listening! Even if Guido doesn't find them convincing for Python, we might find them interesting at XEmacs.

On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote:
saving buffers urgently when the internal state is inconsistent sounds like carefully preserving a bug. Windows 7 64-bit on one of my computers happily crashes several times a day when it detects inconsistent internal state... under the theory, I guess, that losing work is better than saving bad work. You sound the opposite. I'm actually very grateful that Firefox and emacs recover gracefully from Windows crashes, and I lose very little data from the crashes, but cannot recommend Windows 7 (this machine being my only experience with it) for stability. In any case, the operations you mention still require the data to be processed, if ever so slightly, and I'll admit that a more complex representation would require a bit more processing. Not clear that it would be huge or problematical for these cases. Except, I'm not sure how PEP 393 space optimization fits with the other operations. It may even be that an application-wide complex-grapheme cache would save significant space, although if it uses high-bits in a string representation to reference the cache, PEP 393 would jump immediately to something > 16 bits per grapheme... but likely would anyway, if complex-graphemes are in the data stream.
I didn't attribute any efficiency to flagging lone surrogates (BI-5). Since Windows uses a non-validated UCS-2 or UTF-16 character type, any Python program that obtains data from Windows APIs may be confronted with lone surrogates or inappropriate combining characters at any time. Round-tripping that data seems useful, even though the data itself may not be as useful as validated Unicode characters would be. Accidentally combining the characters due to slicing and dicing the data, and doing normalizations, or what not, would not likely be appropriate. However, returning modified forms of it to Windows as UCS-2 or UTF-16 data may still cause other applications to later accidentally combine the characters, if the modifications juxtaposed things to make them look reasonably, even if accidentally. If intentionally, of course, the bit could be turned off. This exact sort of problem with non-validated UTF-8 bytes was addressed already in Python, mostly for Linux, allowing round-tripping of the byte stream, even though it is not valid. BI-6 suggests a different scheme for that, without introducing lone surrogates (which might accidentally get combined with other lone surrogates).
The brainstorming ideas I presented were just that... ideas. And they were independent. And the use of many high-order bits for properties was one of the independent ones. When I wrote that one, I was assuming a UTF-32 representation (which wastes 11 bits of each 32). One thing I did have in mind, with the high-order bits, for that representation, was to flag the start or end or middle of the codes that are included in a grapheme. That would be redundant with some of the Unicode codepoint property databases, if I understand them properly... whether it would make iterators enough more efficient to be worth the complexity would have to be benchmarked. After writing all those ideas down, I actually preferred some of the others, that achieved O(1) real grapheme indexing, rather than caching character properties.
What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393?
If it is a separate type from str, then it costs nothing except the extra code space to implement the cache for those applications that do want it... most of which wouldn't be loaded for applications that don't, if done as a module or C extension.
OK... ignore the bit-cache idea (BI-1), and reread the others without having your mind clogged with that one, and see if any of them make sense to you then. But you may be too biased by the "minor" needs of keeping the internal representation similar to the stream representation to see any value in them. I rather like BI-2, since it allow O(1) indexing of graphemes.

Glenn Linderman writes:
Definitely. Windows apps habitually overwrite existing work; saving when inconsistent would be a bad idea. The apps I work on dump their unsaved buffers to new files, and give you a chance to look at them before instating them as the current version when you restart.
The only language I know of that uses thousands of complex graphemes is Korean ... and the precomposed forms are already in the BMP. I don't know how many accented forms you're likely to see in Vietnamese, but I suspect it's less than 6400 (the number of characters in private space in the BMP). So for most applications, I believe that mapping both non-BMP code points and grapheme clusters into that private space should be feasible. The only potential counterexample I can think of is display of Arabic, which I have heard has thousands of glyphs in good fonts because of the various ways ligatures form in that script. However AFAIK no apps encode these as characters; I'm just admitting that it *might* be useful. This will require some care in registering such characters and clusters because input text may already use private space according to some convention, which would need to be respected. Still, 6400 characters is a lot, even for the Japanese (IIRC the combined repertoire of "corporate characters" that for some reason never made it into the JIS sets is about 600, but almost all of them are already in the BMP). I believe the total number of Japanese emoticons is about 200, but I doubt that any given text is likely to use more than a few. So I think there's plenty of space there. This has a few advantages: (1) since these are real characters, all Unicode algorithms will apply as long as the appropriate properties are applied to the character in the database, and (2) it works with a narrow code unit (specifically, UCS-2, but it could also be used with UTF-8). If you really need more than 6400 grapheme clusters, promote to UTF-32, and get two more whole planes full (about 130,000 code points).
I don't think so. AFAIK all that data must pass through a codec, which will validate it unless you specifically tell it not to.
Round-tripping that data seems useful,
The standard doesn't forbid that. (ISTR it did so in the past, but what is required in 6.0 is a specific algorithm for identifying well-formed portions of the text, basically "if you're currently in an invalid region, read individual code units and attempt to assemble a valid sequence -- as soon as you do, that is a valid code point, and you switch into valid state and return to the normal algorithm".) Specifically, since surrogates are not characters, leaving them in the data does not constitute "interpreting them as characters." I don't recall if any of the error handlers allow this, though.
In CPython AFAIK (I don't do Windows) this can only happen if you use a non-default error setting in the output codec.
If you need O(1) grapheme indexing, use of private space seems a winner to me. It's just defining private precombined characters, and they won't bother any Unicode application, even if they leak out.
I'm talking about the bit-cache (which all of your BI-N referred to, at least indirectly). Many applications will want to work with fully composed characters, whether they're represented in a single code point or not. But they may not care about any of the bit-cache ideas.
No, I'm biased by the fact that I already good ways to do them without leaving the set of representations provided by Unicode (often ways which provide additional advantages), and by the fact that I myself don't know any use cases for the bit-cache yet.
I rather like BI-2, since it allow O(1) indexing of graphemes.
I do too (without suggesting a non-standard representation, ie, using private space), but I'm sure that wheel has been reinvented quite frequently. It's a very common trick in text processing, although I don't know of other applications where it's specifically used to turn data that "fails to be an array just a little bit" into a true array (although I suppose you could view fixed-width EUC encodings that way).

On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote: [me]
Hm, that's not how I would read "process". IMO that is an intentionally vague term, and we are free to decide how to interpret it. I don't think it will work very well to define a process as a Python module; what about Python modules that agree about passing along array of code units (or streams of UTF-8, for that matter)? This is why I find the issue of Python, the language (and stdlib), as a whole "conforming to the Unicode standard" such a troublesome concept -- I think it is something that an application may claim, but the language should make much more modest claims, such as "the regular expression syntax supports features X, Y and Z from the Unicode recommendation XXX, or "the UTF-8 codec will never emit a sequence of bytes that is invalid according Unicode specification YYY". (As long as the Unicode references are also versioned or dated.) I'm fine with saying "it is hard to write Unicode-conforming application code for reason ZZZ" and proposing a fix (e.g. PEP 393 fixes a specific complaint about code units being inferior to code points for most types of processing). I'm not fine with saying "the string datatype should conform to the Unicode standard".
But if you can observe (valid) surrogate pairs it is still UTF-16.
Ok, I dig this, to some extent. However saying it is UCS-2 is equally bad. I guess this is why Java and .NET just say their string types contain arrays of "16-bit characters", with essentially no semantics attached to the word "character" besides "16-bit unsigned integer". At the same time I think it would be useful if certain string operations like .lower() worked in such a way that *if* the input were valid UTF-16, *then* the output would also be, while *if* the input contained an invalid surrogate, the result would simply be something that is no worse (in particular, those are all mapped to themselves). We could even go further and have .lower() and friends look at graphemes (multi-code-point characters) if the Unicode std has a useful definition of e.g. lowercasing graphemes that differed from lowercasing code points. An analogy is actually found in .lower() on 8-bit strings in Python 2: it assumes the string contains ASCII, and non-ASCII characters are mapped to themselves. If your string contains Latin-1 or EBCDIC or UTF-8 it will not do the right thing. But that doesn't mean strings cannot contain those encodings, it just means that the .lower() method is not useful if they do. (Why ASCII? Because that is the system encoding in Python 2.)
I think we should just document how it behaves and not get hung up on what it is called. Mentioning UTF-16 is still useful because it indicates that some operations may act properly on surrogate pairs. (Also because of course character properties for BMP characters are respected, etc.)
Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous) -- they are sequences of multiple code points that represent a single "visual squiggle" (the kind of thing that you'd want to be swappable in vim with "xp" :-). I agree that APIs are needed to manipulate (match, generate, validate, mutilate, etc.) things at the grapheme level. I don't agree that this means a separate data type is required. There are ever-larger units of information encoded in text strings, with ever farther-reaching (and more vague) requirements on valid sequences. Do you want to have a data type that can represent (only valid) words in a language? Sentences? Novels? I think that at this point in time the best we can do is claim that Python (the language standard) uses either 16-bit code units or 21-bit code points in its string datatype, and that, thanks to PEP 393, CPython 3.3 and further will always use 21-bit code points (but Jython and IronPython may forever use their platform's native 16-bit code unit representing string type). And then we add APIs that can be used everywhere to look for code points (even if the string contains code points), graphemes, or larger constructs. I'd like those APIs to be designed using a garbage-in-garbage-out principle, where if the input conforms to some Unicode requirement, the output does too, but if the input doesn't, the output does what makes most sense. Validation is then limited to codecs, and optional calls. If you index or slice a string, or create a string from chr() of a surrogate or from some other value that the Unicode standard considers an illegal code point, you better know what you are doing. I want chr(i) to be valid for all values of i in range(2**21), so it can be used to create a lone surrogate, or (on systems with 16-bit "characters") a surrogate pair. And also ord(chr(i)) == i for all i in range(2**21). I'm not sure about ord() on a 2-character string containing a surrogate pair on systems where strings contain 21-bit code points; I think it should be an error there, just as ord() on other strings of length != 1. But on systems with 16-bit "characters", ord() of strings of length 2 containing a valid surrogate pair should work. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:10 AM, Guido van Rossum wrote:
So if Python 3.3+ uses Unicode codepoints as its str representation, the analogy to ASCII and Python 2 would imply that it should permit out-of-range codepoints, if they can be represented in the underlying data values. Valid codecs would not create such on input, and Valid codecs would not accept such on output. Operations on codepoints should, like .lower(), use the identity operation when applied to non-codepoints.
Interesting ideas. Once you break the idea that every code point must be directly indexed, and that higher level concepts can be abstracted, appropriate codecs could produce a sequence of words, instead of characters. It depends on the purpose of the application whether such is interesting or not. Working a bit with ebook searching algorithms lately, and one idea is to extract from the text a list of words, and represent the words with codes. Do the same for the search string. Then the search, instead of searching for characters and character strings, and skipping over punctuation, etc., it can simply search for the appropriate sequence of word codes. In this case, part of the usefulness of the abstraction is the elimination of punctuation, so it is more of an index to the character text rather an encoding of it... but if the encoding of the text extracted words, the creation of the index would then be extremely simple. I don't have applications in mind where representing sentences or novels would be particularly useful, but representing words could be extremely useful. Valid words? Given a language (or languages) and dictionary (or dictionaries), words could be flagged as valid or invalid for that dictionary. Representing invalid words, could be similar to the idea of the representing of invalid UTF-8 bytes using the lone-surrogate error handler... possible when the application requests such.
So limiting the code point values to 21-bits (wasting 11 bits) only serves to prevent applications from using those 11 bits when they have extra-Unicode values to represent. There is no shortage of 32-bit datatypes to draw from, though, but it seems an unnecessary constraint if exact conformance to Unicode is not provided... conforming codecs wouldn't create such values on input nor accept them on output, so the constraint only serves to restrict applications from using all 32-bits of the underlying storage.
Yep. So str != Unicode. You keep saying that :) And others point out how some applications would benefit from encapsulating the complexities of Unicode semantics at various higher levels of abstractions. Sure, it can be tacked on, by adding complex access methods to a subtype of str, but losing O(1) indexing of those higher abstractions when that route is chosen.

On 8/31/2011 1:10 PM, Guido van Rossum wrote:
This will be a great improvement. It was both embarrassing and frustrating to have to respond to Tom C.'s (and other's) issue with "Our unicode type is too vaguely documented to tell whether you are reporting a bug or making a feature request.
As I said on the tracker, our narrow builds are in-between (while moving closer to UTF-16), and both terms are deceptive, at least to some.
Good analogy.
I presume by 'separate data type' you mean a base level builtin class like int or str and that you would allow for wrapper classes built on top of str, as such are not really 'separate'. For grapheme leval and higher, we should certainly start with wrappers and probably with alternate versions based on different strategies.
Actually, it is range(0X110000) == range(1114112) so that UTF-8 uses at most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.
for i in range(0x110000): # 1114112 if ord(chr(i)) != i: print(i) # prints nothing (on Windows)
And now does, thanks to whoever fixed this (withing the last year, I think). -- Terry Jan Reedy

On Thu, Sep 1, 2011 at 8:02 AM, Terry Reedy <tjreedy@udel.edu> wrote:
We should probably just explicitly document that the internal representation in narrow builds is a UCS-2/UTF-16 hybrid - like UTF-16, it can handle the full code point space, but, like UCS-2, it allows code unit sequences (such as lone surrogates) that strict UTF-16 would reject. Perhaps we should also finally split strings out to a dedicated section on the same tier as Sequence types in the library reference. Yes, they're sequences, but they're also so much more than that (try as you might, you're unlikely to be successful in ducktyping strings the way you can sequences, mappings, files, numbers and other interfaces. Needing a "real string" is even more common than needing a "real dict", especially after the efforts to make most parts of the interpreter that previously cared about the latter distinction accept arbitrary mapping objects). I've created http://bugs.python.org/issue12874, suggesting that the "Sequence Types" and "memoryview type" sections could be usefully rearranged as: Sequence Types - list, tuple, range Text Data - str Binary Data - bytes, bytearray, memoryview Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Where I cut your words, we are in 100% agreement. (FWIW :-) Guido van Rossum writes:
On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I agree. I'm sorry that I didn't make myself clear. The reason I read "process" as "module" is that some modules of Python, and therefore Python as a whole, cannot conform to the Unicode standard. Eg, anything that inputs or outputs bytes. Therefore only "modules" and "types" can be asked to conform. (I don't think it makes sense to ask anything lower level to conform. See below where I comment on your .lower() example.) What I am advocating (for the long term) is provision of *one* module (or type) such that if the text processing done by the application is done entirely in terms of this module (type), it will conform (to some specified degree, chosen to balance user wants with implementation and support costs). It may be desireable to provide others for sufficiently important particular use cases, but at present I see a clear need for *one*. Unicode conformance is going to be a common requirement for apps used by global enterprises. I oppose trying to make str into that type. We need str, just as it is, for many reasons.
Certainly a group of cooperating modules could form a conforming process, just as you describe it for one example. The "one module" mentioned above need not implement everything internally, but it would take responsiblity for providing guarantees (eg, unit tests) of whatever conformance claims it makes.
In the concrete implementation I have in mind, surrogate pairs are represented by a str containing 2 code units. But in that case s[i][1] is an error, and s[i][0] == s[i]. print(s[i][0]) and print(s[i]) will print the same character to the screen. If you decode it to bytes, well, it's not a str any more so what have you proved? Ie, what you will see is *code points* not in the BMP. You don't have to agree that such "surrogate containment" behavior is so valuable as I think it is, but that's what I have in mind as one requirement for a "conforming implementation of UTF-16".
I don't think that it's a good idea to go for conformance at the method level. It would be a feature for apps that don't claim full conformance because they nevertheless give good results in more cases. The downside will be Python apps using str that will pass conformance tests written for, say Western Europe, but end users in Kuwait and Kuala Lumpur will report bugs.
Sure. I think that approach is fine for str, too, except that I would hope it looks up BMP base characters in the case-mapping database. The fact is that with very few exceptions non-BMP characters are going to be symbols (mathematical operators and emoticons, for example). This is good enough, except when it's not---but when it's not, only 100% conformance is really a reasonable target. IMO, of course.
I think we should just document how it behaves and not get hung up on what it is called. Mentioning UTF-16
If you also say, "this type can represent all characters in Unicode, as well as certain non-characters", why mention UTF-16 at all?
Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous)
OK, but those definitions need to be made clear, as "grapheme cluster" and "combined character" are defined in the Unicode standard, and in fact mean slightly different things from each other.
Clear enough.
No, and I can tell you why! The difference between characters and words is much more important than that between code point and grapheme cluster for most users and the developers who serve them. Even small children recognize typographical ligatures as being composite objects, while at least this Spanish-as-a-second-language learner was taught that `ñ' is an atomic character represented by a discontiguous glyph, like `i', and it is no more related to `n' than `m' is. Users really believe that characters are atomic. Even in the cases of Han characters and Hangul, users think of the characters as being "atomic," but in the sense of Bohr rather than that of Democritus. I think the situation for text processing is analogous to chemistry where the atom, with a few fairly gross properties (the outer electron orbitals) is the fundamental unit, not the elementary particles like electrons and protons and structures like inner orbitals. Sure, there are higher order structures like molecules, phases, and crystals, but it is elements that have the most regular and simply described behavior for the chemist, and it does not become any simpler for the chemist if you decompose the atom. The composed character or grapheme cluster is the analogue of the atom for most processing at the level of "text". The only real exceptions I can imagine are in the domain of linguistics.
Clear enough. I disagree that that will be enough for constructing large-scale Unicode-conformant applications. Somebody is going to have to produce batteries for those applications, and I think they should be included in Python. I agree that it's proper that I and those who think the same way take responsibility for writing and implementing a PEP.
I think that's like asking a toddler to know that the stove is hot. The consequences for the toddler of her ignorance are much greater, but the informational requirement is equally stringent. Of course application writers are adults who could be asked to learn, but economically I think it make a lot more sense to include those batteries. IMHO YMMV, obviously.
I want chr(i) to be valid for all values of i in range(2**21),
I quite agree (ie, for str). Thus I perceive a need for another type.

On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Where I cut your words, we are in 100% agreement. (FWIW :-)
Not quite the same here, but I don't feel the need to have the last word. Most of what you say makes sense, in some cases we'll quibble later, but there are a few points where I have something to add:
True -- in fact I didn't know that ff and ffl ligatures *existed* until I learned about Unix troff.
Ah, I think this may very well be culture-dependent. In Holland there are no Dutch words that use accented letters, but the accents are known because there are a lot of words borrowed from French or German. We (the Dutch) think of these as letters with accents and in fact we think of the accents as modifiers that can be added to any letter (at least I know that's how I thought about it -- perhaps I was also influenced by the way one had to type those on a mechanical typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-), when there are two consecutive vowels that would normally be read as a special sound (diphthong?). E.g. in "koe" (cow) the oe is two letters (not a single letter formed of two distict shapes!) that mean a special sound (roughly KOO). But in a word like "coëxistentie" (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? I'm also curious if ñ occurs as a separate character on Spanish keyboards. -- --Guido van Rossum (python.org/~guido)

Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
Indeed, they are not separate "letters" (they are considered the same in lexicographic order, and the French alphabet has 26 letters). But I'm not sure how it's relevant, because you can't remove an accent without most likely making a spelling error, or at least changing the meaning. Accents are very much part of the language (while ligatures like "ff" are not, they are a rendering detail). So I would consider "é", "ê", "ù", etc. atomic characters for the purpose of processing French text. And I don't see how a decomposed form could help an application. Regards Antoine.

On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
The example given was someone who didn't agree with how a particular font rendered those accented characters. I agree that's obscure though. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? Maybe it only was a compromise in the time of Morse code? -- --Guido van Rossum (python.org/~guido)

I think it is tolerated, partly because typing support (on computers and typewriters) has been weak. On a French keyboard, you have an "é" key, but shifting it gives you "2", not "É". The latter can be obtained using the Caps Lock key under Linux, but not under Windows. (so you could also write Éric's name "Eric", for example) That said, most typographies nowadays seem careful to keep the accents on uppercase letters (e.g. on book covers; AFAIR, road signs also keep the accents, but I'm no driver). Regards Antoine.

Antoine Pitrou, 01.09.2011 18:46:
AFAIR, road signs also keep the accents, but I'm no driver
Right, I noticed that, too. That's certainly not uncommon. I think it's mostly because of local pride (after all, the road signs are all that many drivers ever see of a city), but sometimes also because it can't be helped when the name gets a different meaning without accents. People just cause too many accidents when they burst out laughing while entering a city by car. Stefan

Guido van Rossum, 01.09.2011 18:31:
So does the German alphabet, even though that does not include "ß", which basically descended from a ligature of the old German way of writing "sz", where "s" looked similar to an "f" and "z" had a low hanging tail. IIRC, German Umlaut letters are lexicographically sorted according to their emergency replacement spelling ("ä" -> "ae"), which is also sometimes used in all upper case words ("Glück" -> "GLUECK"). I guess that's because Umlaut dots are harder to see on top of upper case letters. So, Latin-1 byte value sorting always yields totally wrong results. That aside, Umlaut letters are commonly considered separate letters, different from the undotted letters and also different from the replacement spellings. I, for one, always found the replacements rather weird and never got used to using them in upper case words. In any case, it's wrong to always use them, and it makes text harder to read.
Yes, and it's a huge problem when trying to pronounce last names. In French, you'd commonly write LASTNAME, Firstname and if LASTNAME happens to have accented letters, you'd miss them when reading that. I know a couple of French people who severely suffer from this, because the pronunciation of their name gets a totally different meaning without accents. Stefan

Guido van Rossum wrote:
This page features a number of French street signs in all-caps, and some of them have accents: http://www.happymall.com/france/paris_street_signs.htm -- Greg

Antoine Pitrou wrote:
On the other hand, the same doesn't necessarily apply to other languages. (At least according to Wikipedia.) http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_dia... -- Steven

On Sep 1, 2011, at 9:30 PM, Steven D'Aprano wrote:
For example, in Serbo-Croatian (Serbian, Croatian, Bosnian, Montenegrin, if you want), each of the following letters represent one distinct sound of the language. In Serbian Cyrillic alphabet, they are distinct symbols. In Latin alphabet, the corresponding letters are formed with diacritics because the alphabet is shorter. Letter Approximate pronunciation Cyrillic ------ ------------------------- -------- č tch in butcher ч ć ch in chapter, but softer ћ dž j in jump џ đ j in juice ђ š sh in ship ш ž s in pleasure, measure, ... ж The language has 30 sounds and the corresponding 30 letters. See the count of the letters in these tables: - http://hr.wikipedia.org/wiki/Hrvatska_abeceda - http://sr.wikipedia.org/wiki/Азбука Diacritics are used in grammar books and in print (occasionally) to distinguish between four different accents of the language: - long rising: á, - short rising: à, - long falling: ȃ (inverted breve, *not* a circumflex â), and - short falling: ȁ, especially when the words that use the same sounds -- thus, spelled with the same letters -- are next to each other. The accents are used to change the intonation of the whole word, not to change the sound of the letter. For example: "Ja sam sȃm." -- "I am alone." Both words "sam" contain the "a" sound, but the first one is pronounced short. As a form of the verb "to be" it's an enclitic that takes the accent of the preceding word "I". The second one is pronounced with a long falling accent. The macron can be used to indicate the length of a *non-stressed* vowel, e.g. ā, but is usually unnecessary in standard print. Many languages use alphabets that are not suitable to their sound system. The speakers of these languages adapted alphabets to their sounds either by using letters with distinct shapes (Cyrillic letters above), or adding diacritics to an existing shape (Latin letters above). The new combined form is a distinct letter. These letters have separate sections in dictionaries and a sorting order. The diacritics that indicate an accent or length are used only above vowels and do *not* represent distinct letters. Best regards, Zvezdan Petković P.S. Since I live in the USA, the last letter of my surname is *wrongly* spelled (ć -> c) and pronounced (ch -> k) most of the time. :-)

On 9/1/2011 11:45 AM, Guido van Rossum wrote:
typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-),
You remember correctly. According to https://secure.wikimedia.org/wikipedia/en/wiki/Trema_%28diacritic%29 'trema' (Greek 'hole') is the generic name of the double-dot vowel diacritic. It was originally used for 'diaerhesis' (Greek, 'taking apart') when it shows "that a vowel letter is not part of a digraph or diphthong". (Note that 'ae' in diaerhesis *is* a digraph ;-). Germans later used it to indicate umlaut, 'changed sound'.
So the above is trema-diaerhesis. "Dutch, French, and Spanish make regular use of the diaeresis." English uses such as 'coöperate' have become rare or archaic, perhaps because we cannot type them. Too bad, since people sometimes use '-' to serve the same purpose. -- Terry Jan Reedy

Terry Reedy wrote:
Too bad, since people sometimes use '-' to serve the same purpose.
Which actually seems more logical to me -- a separating symbol is better placed between the things being separated, rather than over the top of one of them! Maybe we could compromise by turning the diaeresis on its side: co:operate -- Greg

Guido van Rossum writes:
On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm not an expert, but I'm fairly sure it is. Specifically, I heard from a TeX-ie friend that the same accented letter is typeset (and collated) differently in different European languages because in some of them the accent is considered part of the letter (making a different character), while in others accents modify a single underlying character. The ones that consider the letter and accent to constitute a single character also prefer to leave less space, he said.
American English has the same usage, but it's optional (in particular, you'll see naive, naif, and words like coordinate typeset that way occasionally, for the same reason I suppose). As Hagen Fürstenau points out, with multiple combining characters, there are even more complex possibilities than "the accent is part of the character" and "it's really not", and they may be application- dependent.
You'd have to ask Mr. Gonzalez. I suspect he may have taught that way less because of his Castellano upbringing, and more because of the infamous lack of sympathy of American high school students for the fine points of usage in foreign languages.
I'm also curious if ñ occurs as a separate character on Spanish keyboards.
If I'm reading /usr/share/X11/xkb/symbols/es correctly, it does in X.org: the key that for English users would map to ASCII tilde.

If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not considered separately. http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol (I can't tell you how annoying to type "ñ" is when the tilde is accessed using AltGr + 2 and you have to combine that with the Compose key and N to obtain the full character. I'm sure Spanish keyboards have a better way than that :-)) Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote:
FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5f2UQACgkQ+gerLs4ltQ4URACePSZzpoPAg2IIYZewsjbuplkK 0MgAoM7VfdQHzjBiU6Vr/MYPJ9U2qC3M =pvKn -----END PGP SIGNATURE-----

On Thu, 01 Sep 2011 12:38:07 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
That Wikipedia article also says: “Los dígrafos Ch y Ll tienen valores fonéticos específicos, y durante los siglos XIX y XX se ordenaron separadamente de C y L, aunque la práctica se abandonó en 1994 para homogeneizar el sistema con otras lenguas.” -> roughly: “the "Ch" and "Ll" digraphs have specific phonetic values, and during the 19th and 20th centuries they were ordered separately from C and L, but this practice was abandoned in 1994 in order to make the system consistent with other languages.” And about "rr": “El dígrafo rr (llamado erre, /'ere/, y pronunciado /r/) nunca se consideró por separado, probablemente por no aparecer nunca en posición inicial.” -> “the "rr" digraph was never considered separate, probably because it never appears at the very beginning of a word.” Regards Antoine.

Tres Seaver writes:
FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.
That was always a Castellano vs. Americano issue, IIRC. As I wrote, Mr. Gonzalez was Castellano. I believe that the deprecation of the digraphs as separate letters occurred as the telephone became widely used in Spain, and the telephone company demanded an official proclamation from whatever Ministry is responsible for culture that it was OK to treat the digraphs as two letters (specifically, to collate them that way), so that they could use the programs that came with the OS. So this stuff is not merely variant by culture, but also by economics and politics. :-/

On 9/1/2011 11:59 PM, Stephen J. Turnbull wrote:
The main 'standards body' for Spanish is the Real Academia Española in Madrid, which works with the 21 other members of the Asociación de Academias de la Lengua Española. wikimedia.org/wikipedia/en/wiki/Real_Academia_Española .wikimedia.org/wikipedia/en/wiki/Association_of_Spanish_Language_Academies While it has apparently been criticized as 'conservative' (which is well ought to be), it has been rather progressive in promoting changes such as 'ph' to 'f' (fisica, fone) and dropping silent 'p' in leading 'psi' (sicologia) and silent 's' in leading 'sci' (ciencia). -- Terry Jan Reedy

Terry Reedy wrote:
I find it curious that pronunciation always seems to take precedence over spelling in campaigns like this. Nowadays, especially with the internet increasingly taking over from personal interaction, we probably see words written a lot more often than we hear them spoken. Why shouldn't we change the pronunciation to match the spelling rather than the other way around? -- Greg

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2011 11:59 PM, Stephen J. Turnbull wrote:
- From a casual web search, it looks as though the RAE didn't legislate "letterness" away from the digraphs (as I learned them) until 1994 (about 25 years after I learned the 30-letter alfabeto).
Lovely. :) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5hHswACgkQ+gerLs4ltQ7m9ACeOJZRgjcm9pd0Rnry26zP0I3t 53cAoLv78VD5eIdbjvboLaysoeREIp1t =0PuR -----END PGP SIGNATURE-----

Guido van Rossum writes:
In the original definition of UCS-2 in draft ISO 10646 (1990), everything in the BMP except for 0xFFFF and 0xFFFE was a character, and there was no concept of "surrogate" at all. Later in ISO 10646 (1993)[1], the Surrogate Area was carved out of the Private Area, but UCS-2 implementations simply treat them as (single) characters with special properties. This was more or less backward compatible as all corporate uses of the private area used the lower code points and didn't conflict with the surrogates. Finally (in 2000 or 2003) the definition of UCS-2 in ISO 10646 was revised in a backward- incompatible way to exclude surrogates entirely, ie, nowadays it is a range-restricted version of UTF-16. Footnotes: [1] IIRC, strictly speaking this was done slightly later (1993 or 1994) in an official Amendment to ISO 10646; the Amendment was incorporated into the standard in 2000.

On Sat, 27 Aug 2011 12:17:18 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
It also depends a lot on *actual* measured performance. As someone mentioned in the tracker, the index you use on a string usually comes from a previous string operation (like a search), perhaps with a small offset. So a caching scheme may actually give very good results with a rather small overhead (you could cache, say, the 4 most recent indices and choose the nearest when an indexing operation is done; with utf-8, scanning backward and forward is equally simple). Regards Antoine.

On 8/26/2011 8:23 PM, Antoine Pitrou wrote:
Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than O(n*logn). I even read that in an otherwise good book by a university professor. Fortunately for Python users, Tim Peters ignored that 'wisdom', coded the best O(n*n) sort he could, and then *measured* to find out what was better for what types and lengths of arrays. So not we have a list.sort that sometimes beats the pure O(nlog) quicksort of C libraries. -- Terry Jan Reedy

Terry Reedy wrote:
A nice story, but Quicksort's worst case is O(n*n) too. http://en.wikipedia.org/wiki/Quicksort timsort is O(n) in the best case (all items already in order). You are right though about Tim Peters doing extensive measurements: http://bugs.python.org/file4451/timsort.txt If you haven't read the whole thing, do so. I am in awe -- not just because he came up with the algorithm, but because of the discipline Tim demonstrated in such detailed testing. A far cry from a couple of timeit runs on short-ish lists. -- Steven

Paul Moore writes:
[...]
They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics.
Unfortunately, I don't think it's all that easy to defend. Absent PEP 393 or a restriction to the characters in the BMP, this is a very expensive change, easily visible to interactive users, let alone performance-hungry applications. I personally do advocate the "array of code points" definition, but I don't use IronPython or Jython so PEP 393 is as close to heaven as I expect to get. OTOH, I also use Emacsen with Mule, and I have to admit that there is a perceptible performance hit in any large (>1 MB) buffer containing non-ASCII characters vs. pure ASCII (the code unit in Mule is 1 byte). I expect that if IronPython and Jython really want to retain native, code-unit-based representations, it's going to be painful to conform to an "array of code points" specification. There may need to be a compromise of the form "Implementations SHOULD provide an implementation of str that is both O(1) in indexing and an array of code points. Code that is Unicode-ly correct in Python implementing PEP 393 will need to be ported with some effort to implementations that do not satisfy this requirement, perhaps using different algorithms or extra libraries."

On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Actually, the situation is that in narrow builds, they contain code units (which may have surrogates); in wide builds they contain code points. I think this is the crux of Tom Christian's complaints about narrow builds. Here's proof that narrow builds contain code units, not code points (i.e. use UTF-16, not UCS-2): $ ./python Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
It's pretty clear that the interpreter is surrogate-aware, which to me indicates the use of UTF-16. Now in the PEP 393 branch: ./python Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
And some proof that this branch does not care about surrogates:
However: a = '\ud808\udf45'
Which to me merely shows it is smart when parsing string literals. (I expect that regular 3.3 narrow builds behave similar to the 2.7 narrow build, and 3.3 wide builds behave similar to the pep-393 build; I didn't have those lying around.) -- --Guido van Rossum (python.org/~guido)

Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n).
Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor

On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.
It is O(1) after a one-time O(n) preproccessing, which is the same time order for creating the string in the first place. Anyway, I think the most important deficiency is with iteration:
LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C Traceback (most recent call last): File "<pyshell#9>", line 2, in <module> print(name(c)) ValueError: no such name This would work on wide builds but does not here (win7) because narrow build iteration produces a naked non-character surrogate code unit that has no specific entry in the Unicode Character Database. I believe that most new people who read "Strings contain Unicode characters." would expect string iteration to always produce the Unicode characters that they put in the string. The extra time per char needed to produce the surrogate pair that represents the character entered is O(1).
I presented O(log(number of non-BMP chars)) algorithms for indexing and slicing. For the mostly BMP case, that is hugely better than O(n).
PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint if there are any non-BMP chars. On my new machine, I should not mind that in exchange for correct behavior. -- Terry Jan Reedy

On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <tjreedy@udel.edu> wrote:
There are two reasons for this: 1) the str.is* methods get the string and return True/False, so it's enough to iterate on the string, combine the surrogates, and check if the result islower/upper/etc. Methods like lower/upper/etc, afaiu, currently get only a copy of the string, and modify that in place. The current macros advance to the next char during reading and writing, so it's not possible to use them to read/write from/to the same string. We could either change the macros to not advance the pointer [0] (and do it manually in the other functions like is*) or change the function to get the original string too. 2) I'm on vacation. Best Regards, Ezio Melotti [0]: for lower/upper/title it should be possible to modify the string in place, because these operations never converts a non-BMP char to a BMP one (and vice versa), so if two surrogates are read, two surrogates will be written after the transformation. I'm not sure this will work with all the methods though (e.g. str.translate).

[Apologies for sending out a long stream of pointed responses, written before I have fully digested this entire mega-thread. I don't have the patience today to collect them all into a single mega-response.] On Wed, Aug 24, 2011 at 10:45 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str).
Which reminds me. The PEP does not say what other Python implementations besides CPython should do. presumably Jython and IronPython will continue to use UTF-16, so presumably the language reference will still have to document that strings contain code units (not code points) and the objections Tom Christiansen raised against this will remain true for those versions of Python. (I don't know about PyPy, they can presumably decide when they start their Py3k port.) OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and we can lay the narrow build issues to rest? Can someone here speak for them? -- --Guido van Rossum (python.org/~guido)

Guido wrote:
The biggest difficulty for IronPython here would be dealing w/ .NET interop. We can certainly introduce either an IronPython specific string class which is similar to CPython's PyUnicodeObject or we could have multiple distinct .NET types (IronPython.Runtime.AsciiString, System.String, and IronPython.Runtime.Ucs4String) which all appear as the same type to Python. But when Python is calling a .NET API it's always going to return a System.String which is UTF-16. If we had to check and convert all of those strings when they cross into Python it would be very bad for performance. Presumably we could have a 4th type of "interop" string which lazily computes this but if we start wrapping .Net strings we could also get into object identity issues. We could stop using System.String in IronPython all together and say when working w/ .NET strings you get the .NET behavior and when working w/ Python strings you get the Python behavior. I'm not sure how weird and confusing that would be but conversion from an Ipy string to a .NET string could remain cheap if both were UTF-16, and conversions from .NET strings to Ipy strings would only happen if the user did so explicitly. But it's a huge change - it'll almost certainly touch every single source file in IronPython. I would think we'd get 3.2 done first and then think about what to do here.

Antoine Pitrou, 26.08.2011 12:51:
Why would PEP 393 apply to other implementations than CPython?
Not the PEP itself, just the implications of the result. The question was whether the language specification in a post PEP-393 can (and if so, should) be changed into requiring unicode objects to be defined based on code points. Narrow builds, as well as Jython and IronPython, currently deviate from this as they use UTF-16 as their native string encoding, which, for one, prevents O(1) indexing into characters as well as a direct match between length and character count (minus combining characters etc.). I think this discussion can safely be considered off-topic for this thread (which isn't exactly short enough to keep adding more topics to it). Stefan

Le vendredi 26 août 2011 02:01:42, Dino Viehland a écrit :
Python 3 encodes all Unicode strings to the OS encoding (and the result is decoded) for all syscalls and calls to libraries: to the locale encoding on UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. So Py_UNICODE* is already a UTF-16 string. I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for these calls is important or not. But on UNIX, pure ASCII string don't have to be encoded anymore if the locale encoding is UTF-8 or ASCII. IronPython can wait to see how CPython+PEP 383 handles these problems, and how slower it is.
But it's a huge change - it'll almost certainly touch every single source file in IronPython.
With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the string to UTF-16 (allocate memory, etc.). Except that applications should now check if an error occurred (check for NULL).
I would think we'd get 3.2 done first and then think about what to do here.
I don't think that IronPython needs to support non-BMP characters without using surrogates. Bug reports about non-BMP characters usually don't have use cases, but just want to make Python perfect. There is no need to hurry. PEP 393 tries to reduce the memory footprint. The effect on non-BMP character is just a *nice* border effect. Or was the PEP design to solve narrow build issues? Victor

I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does "." match a surrogate pair? Tom C suggests that Java's regex libraries get this and many other details right despite Java's use of UTF-16 to represent strings. So hopefully Jython's re library is built on top of Java's? PS. Is there a better contact for Jython? -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 3:00 PM, Guido van Rossum <guido@python.org> wrote: the cc) - I'll do my best to answer though: Java 5 added a bunch of methods for dealing with Unicode that doesn't fit into 2 bytes - and looking at our code for our Unicode object, I see that we are using methods like the codePointCount method off of java.lang.String to compute length[1] and using similar methods all through that code to make sure we deal in code points when dealing with unicode. So it looks pretty good for us as far as I can tell. [1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePoint..., int) -Frank Wierzbicki

Oops, forgot to add the link for the gory details for Java and > 2 byte unicode: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

On 9/8/2011 6:15 PM, fwierzbicki@gmail.com wrote:
Oops, forgot to add the link for the gory details for Java and> 2 byte unicode:
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
This is dated 2004. Basically, they considered several options, tried out 4, and ended up sticking with char[] (sequences) as UTF-16 with char = 16 bit code unit and added 32-bit Character(int) class for low-level manipulation of code points. I did not see the indexing problem mentioned. I get the impression that they encourage sequence forward-backward iteration (cursor-based access) rather than random-access indexing. -- Terry Jan Reedy

On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy <tjreedy@udel.edu> wrote:
There aren't docs, but the code is here: https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU... Here are (I think) the most relevant bits for random access -- note that getString() returns the internal representation of the PyUnicode which is a java.lang.String @Override protected PyObject pyget(int i) { if (isBasicPlane()) { return Py.makeCharacter(getString().charAt(i), true); } int k = 0; while (i > 0) { int W1 = getString().charAt(k); if (W1 >= 0xD800 && W1 < 0xDC00) { k += 2; } else { k += 1; } i--; } int codepoint = getString().codePointAt(k); return Py.makeCharacter(codepoint, true); } public boolean isBasicPlane() { if (plane == Plane.BASIC) { return true; } else if (plane == Plane.UNKNOWN) { plane = (getString().length() == getCodePointCount()) ? Plane.BASIC : Plane.ASTRAL; } return plane == Plane.BASIC; } public int getCodePointCount() { if (codePointCount >= 0) { return codePointCount; } codePointCount = getString().codePointCount(0, getString().length()); return codePointCount; } -Frank

I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about this affecting their app's performance? --Guido On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Well, I'd be interesting how it goes, since if Jython users find this acceptable then maybe we shouldn't be quite so concerned about it for CPython... On the third hand we don't have working code for this approach in CPython, while we do have working code for the PEP 393 solution... --Guido On Fri, Sep 9, 2011 at 3:38 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 9/9/2011 5:21 PM, Guido van Rossum wrote:
The question is whether or how often any Jython users are yet indexing/slicing long strings with astral chars. If a utf-8 xml file is directly parsed into a DOM, then the longest decoded strings will be 'paragraphs' that are seldom more than 1000 chars.
This is O(1)
This is an O(n) linear scan.
Near the beginning of this thread, I described and gave a link to my O(logk) algorithm, where k is the number of supplementary ('astral') chars. It uses bisect.bisect_left on an int array of length k constructed with a linear scan much like the one above, with one added line. The basic idea is to do the linear scan just once and save the locations (code point indexes) of the astral chars instead of repeating the scan on every access. That could be done as the string is constructed. The same array search works for slicing too. Jython is welcome to use it if you ever decide you need it. I have in mind to someday do some timing tests with the Python version. I just do not know how closely results would be to those for compiled C or Java. -- Terry Jan Reedy

On Tue, Aug 23, 2011 at 08:15, Antoine Pitrou <solipsis@pitrou.net> wrote:
So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.
The WRITE_FLEXIBLE_OR_WSTR macro does a check for kind and then writes. Using this macro for the fast path would be inefficient, to have a real fast path, you would need a outer if to check for kind and then in each condition body the matching access to the string (1, 2, or 4 bytes) and for each body also write 4 or 8 times (guarded by #ifdef, depending on platform). As all these cases bloated up the C code, we went for the simple solution with the goal of profiling the code again afterwards to see where the new performance bottlenecks would be.
To me this feels like this would complicate the C source code and decrease readability. For each function you would need a wrapper which does the kind checking logic and then, in a separate file, the implementation of the function which then gets included three times for each character width. Regards, Torsten

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:
I fully support the declared purpose of the PEP, which I understand to be to have a full,correct Unicode implementation on all new Python releases without paying unnecessary space (and consequent time) penalties. I think the erroneous length, iteration, indexing, and slicing for strings with non-BMP chars in narrow builds needs to be fixed for future versions. I think we should at least consider alternatives to the PEP393 solution of double or quadrupling space if needed for at least one char. In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. I believe the same idea would work for utf8 and the mostly-ascii case. The main difference is that non-ascii chars have various byte sizes rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So the offset correction would not simply be the bisect-left return but would require another lookup byte-index = char-index + offsets[bisect-left(cpdex, char-index)] If possible, I would have the with-index-array versions be separate subtypes, as in utf16.py. I believe either index-array implementation might benefit from a subtype for single multi-unit chars, as a single non-ASCII or non-BMP char does not need an auxiliary [0] array and a senseless lookup therein but does need its length fixed at 1 instead of the number of base array units. -- Terry Jan Reedy

On 8/23/2011 5:46 PM, Terry Reedy wrote:
So am I correctly reading between the lines when, after reading this thread so far, and the complete issue discussion so far, that I see a PEP 393 revision or replacement that has the following characteristics: 1) Narrow builds are dropped. The conceptual idea of PEP 393 eliminates the need for narrow builds, as the internal string data structures adjust to the actuality of the data. If you want a narrow build, just don't use code points > 65535. 2) There are more, or different, internal kinds of strings, which affect the processing patterns. Here is an enumeration of the ones I can think of, as complete as possible, with recognition that benchmarking and clever algorithms may eliminate the need for some of them. a) all ASCII b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This kind may not be able to support a "mostly" variation, and may be no more efficient than case b). But it might also be popular in parts of Europe :) And appropriate benchmarks may discover whether or not it has worth. c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient e) 16-bit codepoints f) UTF-16 with clever indexing/caching to be efficient g) 32-bit codepoints h) UTF-32 When instantiating a str, a new parameter or subtype would restrict the implementation to using only a), b), d), f), and h) when fully conformant Unicode behavior is desired. No lone surrogates, no out of range code points, no illegal codepoints. A default str would prefer a), b), c), e), and g) for efficiency and flexibility. When manipulations outside of Unicode are necessary [Windows seems to use e) for example, suffering from the same sorts of backward compatibility problems as Python, in some ways], the default str type would permit them, using e) and g) kinds of representations. Although the surrogate escape codec only uses prefix surrogates (or is it only suffix ones?) which would never match up, note that a conversion from 16-bit codepoints to other formats may produce matches between the results of the surrogate escape codec, and other unchecked data introduced by the user/program. A method should be provided to validate and promote a string from default, unchecked str type to the subtype or variation that enforces Unicode, if it qualifies; if it doesn't qualify, an exception would be raised by the method. (This could generally be done in place if the value is bound to only a single variable, but would generate a copy and rebind the variable to the promoted copy if it is multiply referenced?) Another parameter or subtype of the conformant str would add grapheme support, which has a different set of rules for the clever indexing/caching, but could be applied to any of a)†, c)†, d), f), or h). † It is unnecessary to apply clever indexing/caching to a) and c) kinds of string internals, because there is a one-to-one mapping between bytes, codepoints, and graphemes in these ranges. So plain array indexing can be used in the implementation of these kinds.

PEP 393 already drops narrow builds.
2) There are more, or different, internal kinds of strings, which affect the processing patterns.
This is the basic idea of PEP 393.
This two cases are already in PEP 393.
c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient
I see neither a need nor a means to consider these.
e) 16-bit codepoints
These are in PEP 393.
f) UTF-16 with clever indexing/caching to be efficient
Again, -1.
g) 32-bit codepoints
This is in PEP 393.
h) UTF-32
What's that, as opposed to g)? I'm not open to revise PEP 393 in the direction of adding more representations. Regards, Martin

On 8/24/2011 1:18 AM, "Martin v. Löwis" wrote:
I'd forgotten that.
Agreed.
Sure. Wanted to enumerate all, rather than just add-ons.
The discussion about "mostly ASCII" strings seems convincing that there could be a significant space savings if such were implemented.
This is probably the one I would pick as least likely to be useful if the rest were implemented.
g) would permit codes greater than u+10ffff and would permit the illegal codepoints and lone surrogates. h) would be strict Unicode conformance. Sorry that the 4 paragraphs of explanation that you didn't quote didn't make that clear.
I'm not open to revise PEP 393 in the direction of adding more representations.
It's your PEP.

Le 24/08/2011 11:22, Glenn Linderman a écrit :
Antoine's optimization in the UTF-8 decoder has been removed. It doesn't change the memory footprint, it is just slower to create the Unicode object. When you decode an UTF-8 string: - "abc" string uses "latin1" (8 bits) units - "aé" string uses "latin1" (8 bits) units <= cool! - "a€" string uses UCS2 (16 bits) units - "a\U0010FFFF" string uses UCS4 (32 bits) units Victor

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string. As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint. It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support. Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency. On the other hand, there are large subsets of applications that simply do not require support for bidirectional text or composed characters, and for those that do not, it remains to be seen if the price to be paid for supporting those features is too high a price for such applications. So far, we don't have implementations to benchmark to figure that out! What does this mean for Python? Well, if Python is willing to limit its support for applications to the subset for which the "big char" solution sufficient, then PEP 393 provides a way to do that, that looks to be pretty effective for reducing memory consumption for those applications that use short strings most of which can be classified by content into the 1 byte or 2 byte representations. Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications. Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed. It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, but Martin has stated he is not open to additional representations, so the PEP itself cannot be that basis (although with care which may or may not be taken in the implementation of the PEP, the implementation may still provide that basis).

On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
(PEP 393, I presume. :-)
As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint.
But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs).
I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level.
There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much. OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds.
I think you are saying that many apps can ignore the distinction between codepoints and characters. Given the complexity of bidi rendering and normalization (which will always remain an issue) I agree; this is much less likely to be a burden than the narrow-build issues with code units vs. codepoints. What should the stdlib do? It should try to skirt the issue where it can (using the garbage-in-garbage-out principle) and advertise what it supports where there is a difference. I don't see why all the stdlib should be made aware of multi-codepoint-characters and other bidi requirements, but it should be clear to the user who has such requirements which stdlib operations they can safely use.
This seems more of an intuition than a fact. I could easily imagine the facts being that even for large strings, usually either there are no outliers, or there is a significant number of outliers. (E.g. Tom Christiansen's OSCON preso falls in the latter category :-). As long as it *works* I don't really mind that there are some extreme cases that are slow. You'll always have that.
Yeah, it's a learning process.
There is always the possibility of representations that are defined purely by userland code and can only be manipulated by that specific code. But expecting C extensions to support new representations that haven't been defined yet sounds like a bad idea. -- --Guido van Rossum (python.org/~guido)

On 8/24/2011 12:34 PM, Guido van Rossum wrote:
This statement might yet be made true :)
In the docs, yes, and in programmer's minds (influenced by docs).
Me neither.
Yes, this is true. In one sense, though, since UTF-8-supporting code already has to deal with variable length codepoint encoding, support for variable length character encoding seems like a minor extension, not upsetting any concept of fixed-width optimizations, because such cannot be used.
Programmers that have to deal with bidi or composed characters shouldn't have such expectations, of course. But there are many programmers who do not, or at least who think they do not, and they can retain their O(1) expectations, I suppose, until it bites them.
Yep, the only justification for narrow builds is in interfacing to underlying broken OS that happen to use that encoding... it might be slightly more efficient when doing API calls to such an OS. But most interesting programs do much more than I/O.
It would seem helpful if the stdlib could have some support for efficient handling of Unicode characters in some representation. It would help address the class of applications that does care. Adding extra support for Unicode character handling sooner rather than later could be an performance boost to applications that do care about full character support, and I can only see the numbers of such applications increasing over time. Such could be built as a subtype of str, perhaps, but if done in Python, there would likely be a significant performance hit when going from str to "unicodeCharacterStr".
Yes, it is intuition, regarding memory consumption. It is not at all clear how different the "occasional outlier character" is than your "significant number of outliers". Tom's presentation certainly was regarding bodies of text which varied from ASCII to fully non-ASCII. The memory characteristics of long string handling would certainly be non-intuitive, when you can process a file of size N with a particular program, but can't process a smaller file because it has a funny character in it, and suddenly you are out of space.
While they can and should be prototyped in Python for functional correctness, I would rather expect such representations to be significantly slower in Python than in C. But that is just intuition also. The PEP makes a nice extension to str representations, but I'm not sure it picks the most useful ones, in that while it is picking cases that are well understood and are in use, they may not be the most effective ones (due to the strange memory consumption characteristics that outliers can introduce). My intuition says that a UTF-8 representation (or Tom's/Perl's looser utf8) would be a handy representation to have. But maybe it should be a different type than str... str8? I suppose that is -ideas land.

On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman <v+python@g.nevcal.com> wrote:
I claim that we have insufficient understanding of their needs to put anything in the stdlib. Wait and see is a good strategy here.
Sounds like overengineering to me. The right time to add something to the stdlib is when a large number of apps *currently* need something, not when you expect that they might need it in the future. (There just are too many possible futures to plan for them all. YAGNI rules.) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum writes:
In fact, the Unicode Standard, Version 6, goes farther (to code units): 2.7 Unicode Strings A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. (p. 32).

On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.) -- --Guido van Rossum (python.org/~guido)

Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
UTF-8 can use more space than latin1 or UCS2:
UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters (or few non-BMP characters). About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) (PEP 393 indexing).
In these worst cases, the PEP 393 is not worse than the current implementation: it just as much memory than Python in wide mode (mode used on Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of Python in narrow mode (Windows). I agree than UTF-8 is better in these corner cases, but I also bet than most Python programs will use less memory and will be faster with the PEP 393. You can already try the pep-393 branch on your own programs.
I used stringbench and "./python -m test test_unicode". I plan to try iobench. Which other benchmark tool should be used? Should we write a new one?
I don't think that the *default* Unicode type is the best place for this. The base Unicode type has to be *very* efficient. If you have unusual needs, write your own type. Maybe based on the base type? Victor

On 25 August 2011 07:10, Victor Stinner <victor.stinner@haypocalc.com>wrote:
I think that the PyPy benchmarks (or at least selected tests such as slowspitfire) would probably exercise things quite well. http://speed.pypy.org/about/ Tim Delaney

Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
I posted a patch to re-add it: http://bugs.python.org/issue12819#msg142867 Victor

On Tue, Aug 23, 2011 at 18:27, Victor Stinner <victor.stinner@haypocalc.com> wrote:
I posted a patch to re-add it: http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this patch adds the fast path only to the helper function which determines the length of the string and the maximum character. The decoding part is still without a fast path for ASCII runs. Regards, Torsten

Le 24/08/2011 04:41, Torsten Becker a écrit :
Ah? If utf8_max_char_size_and_has_errors() returns no error hand maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-) maxchar = utf8_max_char_size_and_has_errors(s, size, &unicode_size, &has_errors); if (has_errors) { ... } else { unicode = (PyUnicodeObject *)PyUnicode_New(unicode_size, maxchar); if (!unicode) return NULL; /* When the string is ASCII only, just use memcpy and return. */ if (maxchar < 128) { assert(unicode_size == size); Py_MEMCPY(PyUnicode_1BYTE_DATA(unicode), s, unicode_size); return (PyObject *)unicode; } ... } But yes, my patch only optimize ASCII only strings, not "mostly-ASCII" strings (e.g. 100 ASCII + 1 latin1 character). It can be optimized later. I didn't benchmark my patch. Victor

Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
Some raw numbers. stringbench: "147.07 203.07 72.4 TOTAL" for the PEP 393 "146.81 140.39 104.6 TOTAL" for default => PEP is 45% slower run test_unicode 50 times: 0m19.487s for PEP 0m17.187s for default => PEP is 13% slower time ./python -m test -j4 ("real" time): 3m16.886s (334 tests) for the PEP 3m21.984s (335 tests) for default ... default has 1 more test! Only 13% slower on test_unicode is *good*. There are still a lot of code using the legacy API in unicode.c, so it cam be much better. stringbench only shows the overhead of the conversion from compact unicode to Py_UNICODE* (wchar_t*). stringlib does still use the legacy API. Victor

On 8/23/2011 6:38 PM, Victor Stinner wrote:
I ran the same benchmark and couldn't make a distinction in performance between them: pep-393.txt 182.17 175.47 103.8 TOTAL cpython.txt 183.26 177.97 103.0 TOTAL pep-393-wide-unicode.txt 181.61 198.69 91.4 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL I ran it a couple times and have seen either default or pep-393 being up to +/- 10 sec slower on the unicode tests. The results of the 8-bit string tests seem to have less variance on my test machine.
$ time ./python -m test `python -c 'print "test_unicode " * 50'` pep-393-wide-unicode.txt real 0m33.409s cpython-wide-unicode.txt real 0m33.489s Nothing in it for me.. except your system is obviously faster, in general. -- Scott Dial scott@scottdial.com

On 8/24/2011 4:11 AM, Victor Stinner wrote:
You are right. I used the "Get Source" link on bitbucket to save pulling the whole clone, but the "Get Source" link seems to be whatever branch has the lastest revision (maybe?) even if you switch branches on the webpage. To correct my previous post: cpython.txt 183.26 177.97 103.0 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL pep-393.txt 181.40 270.34 67.1 TOTAL And, cpython.txt real 0m32.493s cpython-wide-unicode.txt real 0m33.489s pep-393.txt real 0m36.206s -- Scott Dial scott@scottdial.com

On Mon, Aug 22, 2011 at 18:14, Antoine Pitrou <solipsis@pitrou.net> wrote:
- You could trim the debug results from the benchmark results, this may make them more readable.
Good point, I removed them from the wiki page. On Tue, Aug 23, 2011 at 18:38, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Thank you Victor for running stringbench, I did not get to it in time. Regards, Torsten

Torsten Becker, 22.08.2011 20:58:
Very cool! I've started fixing up Cython for it. One thing I noticed: on platforms where wchar_t is signed, the comparison to "128U" in the Py_UNICODE_ISSPACE() macro may issue a warning when applied to a Py_UNICODE value (which it previously was officially defined on). For the sake of portability of existing code, this may be worth a work-around. Personally, I wouldn't really mind getting this warning, given that it's better to use Py_UCS4 instead of Py_UNICODE. But it may turn out to be an annoyance for users, because their code that does this isn't actually broken in the new world. And one thing that I find unfortunate is that we need a new (unexpected) _GET_LENGTH() next to the existing (and obvious) _GET_SIZE(), but I guess that's a somewhat small price to pay for backwards compatibility... Stefan

Torsten Becker, 22.08.2011 20:58:
One thing that occurred to me regarding the object struct: typedef struct { PyObject_HEAD Py_ssize_t length; /* Number of code points in the string */ void *str; /* Canonical, smallest-form Unicode buffer */ Py_hash_t hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this * object are *not* counted in ob_refcnt. * See SSTATE_KIND_* for other bits */ Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* UTF-8 representation (null-terminated) */ Py_ssize_t wstr_length; /* Number of code points in wstr, possible * surrogates count as two code points. */ wchar_t *wstr; /* wchar_t representation (null-terminated) */ } PyUnicodeObject; Wouldn't the "normal" approach be to use a union for the str field? I.e. union str { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; } Given that they're all pointers, all fields have the same size, but I find it more readable to write u.str.latin1 than ((const unsigned char*)u.str) Plus, the three types would be given by the struct, rather than by a per-usage cast. Has this been considered before? Was there a reason to decide against it? Stefan

Has this been considered before? Was there a reason to decide against it?
I think we simply didn't consider it. An early version of the PEP used the lower bits for the pointer to encode the kind, in which case it even stopped being a pointer. Modules are not expected to access this pointer except through the macros, so it may not matter that much. OTOH, it's certainly not too late to change it. Regards, Martin

"Martin v. Löwis", 23.08.2011 15:17:
The difference is that you *could* access them directly in a safe way, if it was a union. So, for an efficient character loop, replicated for performance reasons or for character range handling reasons or whatever, you could just check the string kind and then jump to the loop implementation that handles that type, without using any further macros. Stefan

On Tue, 23 Aug 2011 16:02:54 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
Macros are useful to shield the abstraction from the implementation. If you access the members directly, and the unicode object is represented differently in some future version of Python (say e.g. with tagged pointers), your code doesn't compile anymore. Regards Antoine.

Antoine Pitrou, 23.08.2011 16:08:
Even with tagged pointers, you could just provide a macro that unpacks the pointer to the buffer for a given string kind. I don't think there's much more to be done to keep up the abstraction. I don't see a reason to prevent users from accessing the memory buffer directly, especially not by (accidental, as I understand it) obfuscation through a void*. Stefan

Even with tagged pointers, you could just provide a macro that unpacks the pointer to the buffer for a given string kind.
These macros are indeed available.
It's not about preventing them from accessing the representation. It's an "internal public" structure just as all other object layouts (i.e. feel free to use them, but expect them to change with the next release). However, I still think that people rarely will: - most code treats strings as opaque, just as any other PyObject* - code that is aware of strings typically wants them in an encoded form, often UTF-8, or whatever the underlying C library expects. - code that does need to look at individual characters should be fine with the accessor macros. That said, I can readily believe that Cython would have a use for direct access to the structure. I just wouldn't want people to rewrite their code in four versions (three for the different 3.3 representations, plus one for 3.2 and earlier). Regards, Martin

On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou <solipsis@pitrou.net> wrote:
I agree with Antoine, from the experience of porting C code from 3.2 to the PEP 393 unicode API, the additional encapsulation by macros made it much easier to change the implementation of what is a field, what is a field's actual name, and what needs to be calculated through a function. So, I would like to keep primary access as a macro but I see the point that it would make the struct clearer to access and I would not mind changing the struct to use a union. But then most access currently is through macros so I am not sure how much benefit the union would bring as it mostly complicates the struct definition. Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL"). Regards, Torsten

Torsten Becker, 24.08.2011 04:41:
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
You could just add yet another field "any", i.e. union { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; void* any; } str; That way, the above test becomes if (!unicode->str.any) or if (unicode->str.any == NULL) Or maybe even call it "initialised" to match the intended purpose: if (!unicode->str.initialised) That being said, I don't mind "unicode->str.latin1 == NULL" either, given that it will (as mentioned by others) be hidden behind a macro most of the time anyway. Stefan

Le 24/08/2011 04:41, Torsten Becker a écrit :
An union helps debugging in gdb: you don't have to cast manually to unsigned char*/Py_UCS2*/Py_UCS4*.
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
We can rename "str" to something else, to "data" for example. Victor

On Tue, Aug 23, 2011 at 7:41 PM, Torsten Becker <torsten.becker@gmail.com> wrote:
+1
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
You could add an extra union field for that: unicode->str.voidptr == NULL -- --Guido van Rossum (python.org/~guido)

Le lundi 22 août 2011 20:58:51, Torsten Becker a écrit :
state: lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2 next 2 bits (mask 0x0C) - form of str: 00 => reserved 01 => 1 byte (Latin-1) 10 => 2 byte (UCS-2) 11 => 4 byte (UCS-4); next bit (mask 0x10): 1 if str memory follows PyUnicodeObject kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). Victor

Le mercredi 24 août 2011 00:46:16, Victor Stinner a écrit :
If it can be removed, it would be nice to have kind in [0; 2] instead of kind in [1; 2], to be able to have a list (of 3 items) => callback or label. I suppose that compilers prefer a switch with all cases defined, 0 a first item and contiguous values. We may need an enum. Victor

On Tue, Aug 23, 2011 at 18:56, Victor Stinner <victor.stinner@haypocalc.com> wrote:
It is also used in PyUnicode_DecodeUTF8Stateful() and there might be some cases which I missed converting checks for 0 when I introduced the macro. The question was more if this should be written as 0 or as a named constant. I preferred the named constant for readability. An alternative would be to have kind values be the same as the number of bytes for the string representation so it would be 0 (wstr), 1 (1-byte), 2 (2-byte), or 4 (4-byte). I think the value for wstr/uninitialized/reserved should not be removed. The wstr representation is still used in the error case in the utf8 decoder because these strings can be resized. Also having one designated value for "uninitialized" limits comparisons in the affected functions to the kind value, otherwise they would need to check the str field for NULL to determine in which buffer to write a character.
I suppose that compilers prefer a switch with all cases defined, 0 a first item and contiguous values. We may need an enum.
During the Summer of Code, Martin and I did a experiment with GCC and it did not seem to produce a jump table as an optimization for three cases but generated comparison instructions anyway. I am not sure how much we should optimize for potential compiler optimizations here. Regards, Torsten

Le 24/08/2011 04:56, Torsten Becker a écrit :
Please don't do that: it's more common to need contiguous arrays (for a jump table/callback list) than having to know the character size. You can use an array giving the character size: CHARACTER_SIZE[kind] which is the array {0, 1, 2, 4} (or maybe sizeof(wchar_t) instead of 0 ?).
In Python, you can resize an object if it has only one reference. Why is it not possible in your branch? Oh, I missed the UTF-8 decoder because you wrote "kind = 0": please, use PyUnicode_WCHAR_KIND instead! I don't like "reserved" value, especially if its value is 0, the first value. See Microsoft file formats: they waste a lot of space because most fields are reserved, and 10 years later, these fields are still unused. Can't we add the value 4 when we will need a new kind?
I have to read the code more carefully, I don't know this "uninitialized" state. For kind=0: "wstr" means that str is NULL but wstr is set? I didn't understand that str can be NULL for an initialized string. I should read the PEP again :-)
You mean with a switch with a case for each possible value? I don't think that GCC knows that all cases are defined if you don't use an enum.
I am not sure how much we should optimize for potential compiler optimizations here.
Oh, it was just a suggestion. Sure, it's not the best moment to care of micro-optimizations. Victor

If you use the new API to create a string (knowing how many characters you have, and what the maximum character is), the Unicode object is allocated as a single memory block. It can then not be resized. If you allocate in the old style (i.e. giving NULL as the data pointer, and a length), it still creates a second memory blocks for the Py_UNICODE[], and allows resizing. When you then call PyUnicode_Ready, the object gets frozen.
I don't get the analogy, or the relationship with the value 0. "Reserving" the value 0 is entirely different from reserving a field. In a field, it wastes space; the value 0 however fills the same space as the values 1,2,3. It's just used to denote an object where the str pointer is not filled out yet, i.e. which can still be resized.
No, a computed jump on the assembler level. Consider this code enum kind {null,ucs1,ucs2,ucs4}; void foo(void *d, enum kind k, int i, int v) { switch(k){ case ucs1:((unsigned char*)d)[i] = v;break; case ucs2:((unsigned short*)d)[i] = v;break; case ucs4:((unsigned int*)d)[i] = v;break; } } gcc 4.6.1 compiles this to foo: .LFB0: .cfi_startproc cmpl $2, %esi je .L4 cmpl $3, %esi je .L5 cmpl $1, %esi je .L7 .p2align 4,,5 rep ret .p2align 4,,10 .p2align 3 .L7: movslq %edx, %rdx movb %cl, (%rdi,%rdx) ret .p2align 4,,10 .p2align 3 .L5: movslq %edx, %rdx movl %ecx, (%rdi,%rdx,4) ret .p2align 4,,10 .p2align 3 .L4: movslq %edx, %rdx movw %cx, (%rdi,%rdx,2) ret .cfi_endproc As you can see, it generates a chain of compares, rather than an indirect jump through a jump table. Regards, Martin

Hello, On Mon, 22 Aug 2011 14:58:51 -0400 Torsten Becker <torsten.becker@gmail.com> wrote:
A couple of minor comments: - “The UTF-8 decoding fast path for ASCII only characters was removed and replaced with a memcpy if the entire string is ASCII.” The fast path would still be useful for mostly-ASCII strings, which are extremely common (unless UTF-8 has become a no-op?). - You could trim the debug results from the benchmark results, this may make them more readable. - You could try to run stringbench, which can be found at http://svn.python.org/projects/sandbox/trunk/stringbench (*) and there's iobench (the text mode benchmarks) in the Tools/iobench directory. (*) (yes, apparently we forgot to convert this one to Mercurial) Regards Antoine.

Am 23.08.2011 11:46, schrieb Xavier Morel:
I know - I still question whether it is "extremely common" (so much as to justify a special case). I.e. on what application with what dataset would you gain what speedup, at the expense of what amount of extra lines, and potential slow-down for other datasets? For the record, the optimization in question is the one where it masks a long word with 0x80808080L, to see whether it is completely ASCII, and then copies four characters in an unrolled fashion. It stops doing so when it sees a non-ASCII character, and returns to that mode when it gets to the next aligned memory address that stores only ASCII characters. In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size. Regards, Martin

Well, it's: - all natural languages based on a variant of the latin alphabet - but also, XML, JSON, HTML documents... - and log files... - in short, any kind of parsable format which is structurally ASCII but and can contain arbitrary unicode So I would say *most* unicode data out there is mostly-ASCII, even when it has Japanese characters in it. The rationale is that most unicode data processed by computers is structured. This optimization was done when trying to improve the speed of text I/O.
Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach? Regards Antoine.

This optimization was done when trying to improve the speed of text I/O.
So what speedup did it achieve, for the kind of data you talked about?
Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach?
It's a single implementation - see for yourself. Regards, Martin

Le mardi 23 août 2011 à 13:51 +0200, "Martin v. Löwis" a écrit :
This optimization was done when trying to improve the speed of text I/O.
So what speedup did it achieve, for the kind of data you talked about?
Since I don't have the number anymore, I've just saved the contents of https://linuxfr.org/news/le-noyau-linux-est-disponible-en-version%C2%A030 as a "linuxfr.html" file and then did: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 859 usec per loop After disabling the fast path, I ran the micro-benchmark again: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 1.09 msec per loop so that's a 20% speedup.
So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. Even without taking into account the unrolled loop, I wonder how much slower UTF-8 decoding becomes with that approach, by the way. Instead of testing the "kind" variable at each loop iteration, using a stringlib-like approach may be a better deal IMO. Of course we would first need to have various benchmark numbers once the current PEP 393 implementation is complete. Regards Antoine.

So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.
Depending on where the speedup comes from in this optimization, it may well be that the overhead of figuring out where to store the result eats the gain from the fast test.
Even without taking into account the unrolled loop, I wonder how much slower UTF-8 decoding becomes with that approach, by the way.
In some cases, tests show that it gets faster, overall, compared to 3.2. This is probably because strings take less memory, which means less copying, more cache locality, etc. Of course, it still may be possible to apply micro-optimizations to the new implementation.
Well, things have to be done in order: 1. the PEP needs to be approved 2. the performance bottlenecks need to be identified 3. optimizations should be applied. I'm not sure what you mean by "stringlib-like" approach - if you are talking about templating, I'd rather avoid this for maintainability reasons, unless significant improvements can be demonstrated. Torsten had a version that used macros for that, and it was a pain to debug. So we put correctness and readability first. Regards, Martin

Sure, but the whole point of the PEP is to improve performance (I am dumping "memory consumption" in the "performance" bucket). That is, I suppose it will get approval based on its demonstrated benefits.
The point of templating is precisely to avoid macros, so that the code is natural to read and write and the compiler gives you the right line number when it finds an error. Regards Antoine.

On Tue, Aug 23, 2011 at 11:21 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
As Martin noted, cache misses hurt performance so much on modern processors that making things use less memory overall can actually be a speed optimisation as well. Guessing where the remaining bottlenecks are is unlikely to be effective - profiling of the preliminary implementation will be needed. However, the idea that reducing the size of pure ASCII strings (which include all the identifiers in most code) by a factor of 2 or 4 (or so) results in a net speed increase definitely sounds plausible to me, even for non-string processing code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8/23/2011 9:21 AM, Victor Stinner wrote:
The current UCS2 Unicode string implementation, by design, quickly gives WRONG answers for len(), iteration, indexing, and slicing if a string contains any non-BMP (surrogate pair) Unicode characters. That may have been excusable when there essentially were no such extended chars, and the few there were were almost never used. But now there are many more, with more being added to each Unicode edition. They include cursive Math letters that are used in English documents today. The problem will slowly get worse and Python, at least on Windows, will become a language to avoid for dependable Unicode document processing. 3.x needs a proper Unicode implementation that works for all strings on all builds. utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post. -- Terry Jan Reedy

Terry Reedy writes:
Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions. And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so.

On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters." (And to a naive reader, that implies that string iteration and indexing should produce Unicode characters.)
By design, they contain code point strings.
For the purpose of my sentence, the same thing in that code points correspond to characters, where 'character' includes ascii control 'characters' and unicode analogs. The problem is that on narrow builds strings are NOT code point sequences. They are 2-byte code *unit* sequences. Single non-BMP code points are seen as 2 code units and hence given a length of 2, not 1. Strings iterate, index, and slice by 2-byte code units, not by code points. Python floats try to follow the IEEE standard as interpreted for Python (Python has its software exceptions rather than signalling versus non-signalling hardware signals). Python decimals slavishly follow the IEEE decimal standard. Python narrow build unicode breaks the standard for non-BMP code points and cosequently, breaks the re module even when it works for wide builds. As sys.maxunicode more or less says, only the BMP subset is fully supported. Any narrow build string with even 1 non-BMP char violates the standard.
Guido has made that absolutely clear on a number of occasions.
It is not clear what you mean, but recently on python-ideas he has reiterated that he intends bytes and strings to be conceptually different. Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays. Except they are not for non-BMP characters/codepoints. Narrow build unicode is effectively an array of two-byte binary units.
The documentation was changed at least a bit for 3.0, and anyway, as indicated above, it is easy (especially for new users) to read the docs in a way that makes the current behavior buggy. I agree that the implementation should have been changed already. Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want. PEP 393 abolishes narrow builds as we now know them and changes semantics. I was answering a complaint about that change. If you do not like the PEP, fine. My separate proposal in my other post is for an alternative implementation but with, I presume, pretty the same visible changes.
It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today.
If the implementation is invisible to the Python user, as I believe it should be without specially introspection, and mostly invisible in the C-API except for those who intentionally poke into the details, then the implementation can be changed as the consensus on best implementation changes.
Given that 3.0 unicode (string) objects are defined as Unicode character strings, I do not see the opposition.
-- Terry Jan Reedy

Terry Reedy writes:
Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary, of course. I was on your side of the fence when this was discussed, pre-release. I was wrong then. My bet is that we are still wrong, now.
For the purpose of my sentence, the same thing in that code points correspond to characters,
Not in Unicode, they do not. By definition, a small number of code points (eg, U+FFFF) *never* did and *never* will correspond to characters. Since about Unicode 3.0, the same is true of surrogate code points. Some restrictions have been placed on what can be done with composed characters, so even with the PEP (which gives us code point arrays) we do not really get arrays of Unicode characters that fully conform to the model.
strings are NOT code point sequences. They are 2-byte code *unit* sequences.
I stand corrected on Unicode terminology. "Code unit" is what I meant, and what I understand Guido to have defined unicode objects as arrays of.
Any narrow build string with even 1 non-BMP char violates the standard.
Yup. That's by design.
Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK.
Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays.
And indeed they are, in UCS-4 builds. But they are *not* in Unicode! Unicode violates the array model. Specifically, in handling composing characters, and in bidi, where arbitrary slicing of direction control characters will result in garbled display. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. Of the remaining 10%, 90% are not going to need both huge strings *and* ABI interoperability with C modules compiled for UCS-2, so UCS-4 is satisfactory. Of the remaining 1% of all applications, those that deal with huge strings *and* need full Unicode conformance, well, they need efficiency too almost by definition. They probably are going to want something more efficient than either the UTF-16 or the UTF-32 representation can provide, and therefore will need trickier, possibly app-specific, algorithms that probably do not belong in an initial implementation.
I don't. I suspect Guido does not, even today.
Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want.
Let them become developers, then, and show us how to do it better.
No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms).
A naive implementation of UTF-16 will be quite visible in terms of performance, I suspect, and performance-oriented applications will "go behind the API's back" to get it. We're already seeing that in the people who insist that bytes are characters too, and string APIs should work on them just as they do on (Unicode) strings.
I think they're not, I think they're defined as Unicode code unit arrays, and that the documentation is in error. If the documentation is correct, then Python 3.0 was released about 5 years too early, because correct handling of those objects as arrays of Unicode characters has never been implemented or even discussed in terms of proposed code that I know of. Martin has long claimed that the fact that I/O is done in terms of UTF-16 means that the internal representation is UTF-16, so I could be wrong. But when issues of slicing, len() values and so on have come up in the past, Guido has always said "no, there will be no change in semantics of builtins here".

I think what he means (and what I meant when I said something similar): I/O will consider surrogate pairs in the representation when converting to the output encoding. This is actually relevant only for UTF-8 (I think), which converts surrogate pairs "correctly". This can be taken as a proof that Python 3.2 is "UTF-16 aware" (in some places, but not in others). With Python's I/O architecture, it is of course not *actually* the I/O which considers UTF-16, but the codec. Regards, Martin

Antoine Pitrou writes:
But it's not "simple" at the level we're talking about! Specifically, *in-memory* surrogates are properly respected when doing the encoding, and therefore such I/O is not UCS-2 or "raw code units". This treatment is different from sizing and indexing of unicodes, where surrogates are not treated differently from other code points.

I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP. The primary objective is the reduction in memory usage. (any changes in runtime are also side effects, and it's not really clear yet whether you get speedups or slowdowns on average, or no effect).
That's just a description of the implementation, and not part of the language, though. My understanding is that the "abstract Python language definition" considers this aspect implementation-defined: PyPy, Jython, IronPython etc. would be free to do things differently (and I understand that there are plans to do PEP-393 style Unicode objects in PyPy).
Not with these words, though. As I recall, it's rather like (still with different words) "len() will stay O(1) forever, regardless of any perceived incorrectness of this choice". An attempt to change the builtins to introduce higher complexity for the sake of correctness is what he rejects. I think PEP 393 balances this well, keeping the O(1) operations in that complexity, while improving the cross- platform "correctness" of these functions. Regards, Martin

On 8/24/2011 1:50 PM, "Martin v. Löwis" wrote:
I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP.
Then why does the Rationale start with "on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported."? A Windows user can only solve this problem by switching to *nix.
The primary objective is the reduction in memory usage.
On average (perhaps). As I understand the PEP, for some strings, Windows users will see a doubling of memory usage. Statistically, that doubling is probably more likely in longer texts. Ascii-only Python code and other limited-to-ascii text will benefit. Typical English business documents will see no change as they often have proper non-ascii quotes and occasional accented characters, trademark symbols, and other things. I think you have the objectives backwards. Adding memory is a lot easier than switching OSes. -- Terry Jan Reedy

On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.
On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point: 1) i in range(0x11000) <broad definition> 2) "A value, or position, for a character" <narrow definition> (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(.
I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers.
The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard.
I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors.
I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing.
I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization. -- Terry Jan Reedy

Terry Reedy writes:
Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.
Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as "simple" an operation as "s1[0] == s2[0]" cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393.
No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid.
Sophistry. "Always getting the right answer" is purity.
In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module.
Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen. Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and *everything* needs to be special-cased for efficiency, and *any* small change can have show-stopping performance implications. Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional.

On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Really? If strings contain code units, that expression compares code units. What is non-conforming about comparing two code points? They are just integers. Seriously, what does Unicode-conforming mean here? It would be better to specify chapter and verse (e.g. is it a specific thing defined by the dreaded TR18?)
I don't know that we will be able to educate our users to the point where they will use code unit, code point, character, glyph, character set, encoding, and other technical terms correctly. TBH even though less than two hours ago I composed a reply in this thread, I've already forgotten which is a code point and which is a code unit.
Eh? In most other areas Python is pretty careful not to promise to "always get the right answer" since what is right is entirely in the user's mind. We often go to great lengths of defining how things work so as to set the right expectations. For example, variables in Python work differently than in most other languages. Now I am happy to admit that for many Unicode issues the level at which we have currently defined things (code units, I think -- the thingies that encodings are made of) is confusing, and it would be better to switch to the others (code points, I think). But characters are right out.
It is not so easy to change expectations about O(1) vs. O(N) behavior of indexing however. IMO we shouldn't try and hence we're stuck with operations defined in terms of code thingies instead of (mostly mythical) characters.
Let's take small steps. Do the evolutionary thing. Let's get things right so users won't have to worry about code points vs. code units any more. A conforming library for all things at the character level can be developed later, once we understand things better at that level (again, most developers don't even understand most of the subtleties, so I claim we're not ready).
Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)".
I still think that. It would be too big of a cultural upheaval to change it.
This I agree with (though if you were referring to me with "leadership" I consider myself woefully underinformed about Unicode subtleties). I also suspect that Unicode "conformance" (however defined) is more part of a political battle than an actual necessity. I'd much rather have us fix Tom Christiansen's specific bugs than chase the elusive "standard conforming". (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards." http://en.wikipedia.org/wiki/Stinking_badges :-) -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <guido@python.org> wrote:
Indeed, code points are the abstract concept and code units are the specific byte sequences that are used for serialisation (FWIW, I'm going to try to keep this straight in the future by remembering that the Unicode character set is defined as abstract points on planes, just like geometry). With narrow builds, code units can currently come into play internally, but with PEP 393 everything internal will be working directly with code points. Normalisation, combining characters and bidi issues may still affect the correctness of unicode comparison and slicing (and other text manipulation), but there are limits to how much of the underlying complexity we can effectively hide without being misleading. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Hm, code points still look pretty concrete to me (integers in the range 0 .. 2**21) and code units don't feel like byte sequences to me (at least not UTF-16 code units -- in Python at least you can think of them as integers in the range 0 .. 2**16).
Let's just define a Unicode string to be a sequence of code points and let libraries deal with the rest. Ok, methods like lower() should consider characters, but indexing/slicing should refer to code points. Same for '=='; we can have a library that compares by applying (or assuming?) certain normalizations. Tom C tells me that case-less comparison cannot use a.lower() == b.lower(); fine, we can add that operation to the library too. But this exceeds the scope of PEP 393, right? -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum <guido@python.org> wrote:
Yep, I was agreeing with you on this point - I think you're right that if we provide a solid code point based core Unicode type (perhaps with some character based methods), then library support can fill the gap between handling code points and handling characters. In particular, a unicode character based string type would be significantly easier to write in Python than it would be in C (after skimming Tom's bug report at http://bugs.python.org/issue12729, I better understand the motivation and desire for that kind of interface and it sounds like Terry's prototype is along those lines). Once those mappings are thrashed out outside the core, then there may be something to incorporate directly around the 3.4 timeframe (or potentially even in 3.3, since it should already be possible to develop such a wrapper based on UCS4 builds of 3.2) However, there may an important distinction to be made on the Python-the-language vs CPython-the-implementation front: is another implementation (e.g. PyPy) *allowed* to implement character based indexing instead of code point based for 2.x unicode/3.x str type? Or is the code point indexing part of the language spec, and any character based indexing needs to be provided via a separate type or module? Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
GvR writes:
+1 I don't really see an alternative to this approach. The underlying array has to be exposed because there are too many applications that can take advantage of it, and analysis of decomposed characters requires it. Making that array be an array of code points is a really good idea, and Python already has that in the UCS-4 build. PEP 393 is "just" a space optimization that allows getting rid of the narrow build, with all its wartiness.
I agree that it's possible, but I estimate that it's not feasible for 3.3 because we don't yet know the requirements. This one really needs to ferment and mature in PyPI for a while because we just don't know how far the scope of user needs is going to extend. Bidi is a mudball[1], confusable character indexes sound like a cool idea for the web and email but is anybody really going to use them?, etc.
+1 for language spec. Remember, there are cases in Unicode where you'd like to access base characters and the like. So you need to be able to get at individual code points in an NFD string. You shouldn't need to use different code for that in different implementations of Python. Footnotes: [1] Sure, we can implement the UAX#9 bidi algorithm, but it's not good enough by itself: something as simple as "File name (default {0}): ".format(name) can produce disconcerting results if the whole resulting string is treated by the UBA. Specifically, using the usual convention of uppercase letters being an RTL script, name = "ABCD" will result in the prompt: File name (default :(DCBA _ (where _ denotes the position of the insertion cursor). The Hebrew speakers on emacs-devel agreed that an example using a real Hebrew string didn't look right to them, either.

Most certainly. In the PEP-393 representation, the surrogate characters can readily be represented (and would imply atleast the two-byte form), but they will never take their UTF-16 function (i.e. the UTF-8 codec won't try to combine surrogate pairs), so they can be used for surrogateescape and other functions. Of course, in strict error mode, codecs will refuse to encode them (notice that surrogateescape is an error handler, not a codec). Regards, Martin

On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I would think that it should still be possible to explicitly put surrogates into a string, using the appropriate \uxxxx escape or chr(i) or some such approach; the basic string operations IMO shouldn't bother with checking for well-formed character sequences (just as they shouldn't care about normal forms). But decoding bytes from UTF-16 should not leave any surrogate pairs in, since interpreting those is part of the decoding. I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). -- --Guido van Rossum (python.org/~guido)

On Thu, 25 Aug 2011, Guido van Rossum wrote:
If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity. Hmmm, doesn't look good: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information.
'\xed\xb0\x80'.decode ('utf-8') u'\udc00'
Incorrect! Although this is a narrow build - I can't say what the wide build would do. For reasons of practicality, it may be appropriate to provide easy access to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be called UTF-8. Other variations may also find use if provided. See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan@uwaterloo.ca> wrote:
You have a point. The security issues cannot be seen separate from all the other issues. The folks inside Google who care about Unicode often harp on this. So I stand corrected. I am fine with codecs treating code points or code point sequences that the Unicode standard doesn't like (e.g. lone surrogates) the same way as more severe errors in the encoded bytes (lots of byte sequences already aren't valid UTF-8). I just hope this doesn't require normal forms or other expensive operations; I hope it's limited to rejecting invalid use of surrogates or other values that are not valid code points (e.g. 0, or >= 2**21).
Thanks for the links! I also like the term "supplemental character" (a code point >= 2**16). And I note that they talk about characters were we've just agreed that we should say code points... -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <guido@python.org> wrote:
Surrogates are used and valid only in UTF-16. In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type.
What do you mean? We use the "strict" error handler by default and we can specify other handlers already.
Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates). We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream.").
I think there shouldn't be any normalization done automatically by the codecs.
The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ). Luckily this is fixed in Python 3.x. I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me). Best Regards, Ezio Melotti

Isaac Morland, 26.08.2011 04:28:
Works the same for me in a wide Py2.7 build, but gives me this in Py3: Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
Same for current Py3.3 and the PEP393 build (although both have a better exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte"). Stefan

Stefan Behnel wrote:
The reason for this is that the UTF-8 codec in Python 2.x has never rejected lone surrogates and it was used to store Unicode literals in pyc files (using marshal) and also by pickle for transferring Unicode strings, so we could simply reject lone surrogates, since this would have caused compatibility problems. That change was made in Python 3.x by having a special error handler surrogatepass which allows the UTF-8 codec to process lone surrogates as well. BTW: I'd love to join the discussion about PEP 393, but unfortunately I'm swamped with work, so these are just a few comments... What I'm missing in the discussion is statistics of the effects of the patch (both memory and performance) and the effect on 3rd party extensions. I'm not convinced that the memory/speed tradeoff is worth the breakage or whether the patch actually saves memory in real world applications and I'm unsure whether the needed code changes to the binary Python Unicode API can be done in a minor Python release. Note that in the worst case, a PEP 393 Unicode object will save three versions of the same string, e.g. on Windows with sizeof(wchar_t)==2: A UCS4 version in str, a UTF-8 version in utf8 (this gets build whenever Python needs a UTF-8 version of the Object) and a wchar_t version in wstr (which gets build whenever Python codecs or extensions need Py_UNICODE or a wchar_t representation). On all platforms, in the case where you store a Latin-1 non-ASCII string: str holds the Latin-1 string, utf8 the UTF-8 version and wstr the 2- or 4-bytes wchar_t version. * A note on terminology: Python stores Unicode as code points. A Unicode "code point" refers to any value in the Unicode code range which is 0 - 0x10FFFF. Lone surrogates, unassigned and illegal code points are all still code points - this is a detail people often forget. Various code points in Unicode have special meanings and some are not allowed to be used in encodings, but that does not make them rule them out from being stored and processed as code points. Code units are only used in encoded versions Unicode, e.g. the UTF-8, -16, -32. Mixing code units and code points can cause much confusion, so it's better to talk only about code point when referring to Python Unicode objects, since you only ever meet code units when looking at the the bytes output of the codecs. This is important to know, since Python is not only meant to process Unicode, but also to build Unicode strings, so a careful distinction has to be made when considering what is correct and what not: codecs have to follow much more strict rules than Python itself. * A note on surrogates: These are just one particular problem where you run into the situation where splitting a Unicode string potentially breaks a combination of code points. There are a few other types of code points that cause similar problems, e.g. combining code points. Simply going with UCS-4 does not solve the problem, since even with UCS-4 storage, you can still have surrogates in your Python Unicode string. As with many things, it is important to be aware of the potential problem, but there's no automatic fix to get rid of it. What we can do, is make the best of it and this has happened already in many areas, e.g. codecs joining surrogates automatically, chr() creating surrogates, etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 26 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 39 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Guido van Rossum writes:
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
That's true out of context, but in context it's "which for most purposes can be treated as Unicode characters", and this is what Terry is concerned with, as well.
What is non-conforming about comparing two code points?
Unicode conformance means treating characters correctly. In particular, s1 and s2 might be NFC and NFD forms of the same string with a combining character at s2[1], or s1[1] and s[2] might be a non-combining character and a combining character respectively.
Seriously, what does Unicode-conforming mean here?
Chapter 3, all verses. Here, specifically C6, p. 60. One would have to define the process executing "s1[0] == s2[0]" to be sure that even in the cases cited in the previous paragraph non-conformance is occurring, but one example of a process where that is non-conforming (without additional code to check for trailing combining characters) is in comparison of Vietnamese filenames generated on a Mac vs. those generated on a Linux host.
Sure. I got it wrong myself earlier. I think that the right thing to do is to provide a conformant implementation of Unicode text in the stdlib (a long run goal, see below), and call that "Unicode", while we call strings "strings".
Yes, and AFAICT (I'm better at reading standards than I am at reading Python implementation) PEP 393 allows that.
But characters are right out.
+1
Well, O(N) is not really the question. It's really O(log N), as Terry says. Is that out, too? I can verify that it's possible to do it in practice in the long term. In my experience with Emacs, even with 250 MB files, O(log N) mostly gives acceptable performance in an interactive editor, as well as many scripted textual applications. The problems that I see are (1) It's very easy to write algorithms that would be O(N) for a true array, but then become O(N log N) or worse (and the coefficient on the O(log N) algorithm is way higher to start). I guess this would kill the idea, but. (2) Maintenance is fragile; it's easy to break the necessary caches with feature additions and bug fixes. (However, I don't think this would be as big a problem for Python, due to its more disciplined process, as it has been for XEmacs.) You might think space for the caches would be a problem, but that has turned out not to be the case for Emacsen.
I don't think anybody does. That's one reason there's a new version of Unicode every few years.
<wink/> MvL and MAL are not, however, and there are plenty of others who make contributions -- in an orderly fashion.
Well, I would advocate specifying which parts of the standard we target and which not (for any given version). The goal of full "Chapter 3" conformance should be left up to a library on PyPI for the nonce IMO. I agree that fixing specific bugs should be given precedence over "conformance chasing," but implementation should conform to the appropriate parts of the standard.
(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards." http://en.wikipedia.org/wiki/Stinking_badges :-)
RMS beat you to that. Not good company to be in, in this case: he specifically disclaims the goal of portability to non-GNU-System systems.

What is non-conforming about comparing two code points?
Unicode conformance means treating characters correctly.
Re-read the text. You are interpreting something that isn't there.
No, that's explicitly *not* what C6 says. Instead, it says that a process that treats s1 and s2 differently shall not assume that others will do the same, i.e. that it is ok to treat them the same even though they have different code points. Treating them differently is also conforming. Regards, Martin

"Martin v. Löwis" writes:
Then what requirement does C6 impose, in your opinion? It sounds like you don't think it imposes any, in practice. Note that in the discussion of C6, the standard says, - Ideally, an implementation would *always* interpret two canonical-equivalent sequences *identically*. There are practical circumstances under which implementations may reasonably distinguish them. (Emphasis mine.) The examples given are things like "inspecting memory representation structure" (which properly speaking is really outside of Unicode conformance) and "ignoring collation behavior of combining sequences outside the repertoire of a specified language." That sounds like "Special cases aren't special enough to break the rules. Although practicality beats purity." to me. Treating things differently is an exceptional case, that requires sufficient justification. My understanding is that if those strings are exchanged with an another process, then whether or not treating them differently is allowed depends on whether the results will be output to another process, and what the definition of our process is. Sometimes it will be allowed, but mostly it won't. Take file names as an example. If our process is working with an external process (the OS's file system driver) whose definition includes the statement that "File names are sequences of Unicode characters", then C6 says our process must compare canonically equivalent sequences that it takes to be file names as the same, whether or not they are in the same normalized form, or normalized at all, because we can't assume the file system will treat them as different. If we do treat them as different, our users will get very upset (eg, if we don't signal a duplicate file name input by the user, and then the OS proceeds to overwrite an existing file). Dually, having made the statement that file names are Unicode, C6 says that the OS driver must return the same file given two canonically equivalent strings that happen to have different code points in them, because it may not assume that *we* will treat those strings as different names of different files. *Users* will certainly take the viewpoint that two strings that display the same on their monitor should identify the same file when they use them as file names. Now, I'm *not* saying that Python's strings *should* conform to the Unicode standard in this respect yet (or ever, for that matter; I'm with Guido on that). I'm simply saying that the current implementation of strings, as improved by PEP 393, can not be said to be conforming. I would like to see something much more conformant done as a separate library (the Python Components for Unicode, say), intended to support users who need character-based behavior, Unicode-ly correct collation, etc., more than efficiency. Applications that need both will have to make their own way at first, either by contributing improvements to the library or by using application-specific algorithms.

Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
In IETF terminology, it's a weak SHOULD requirement. Unless there are reasons not to, equivalent strings should be treated differently. It's a weak requirement because the reasons not to treat them equivalent are wide-spread.
Ok, so let me put emphasis on *ideally*. They acknowledge that for practical reasons, the equivalent strings may need to be distinguished.
And the common justification is efficiency, along with the desire to support the representation of unnormalized strings (else there would be an efficient implementation).
It may well happen that this requirement is met in a plain Python application. If the file system and GUI libraries always return NFD strings, then the Python process *will* compare equivalent sequences correctly (since it won't ever get any other representations).
Yes, but that's the operating system's choice first of all. Some operating systems do allow file names in a single directory that are equivalent yet use different code points. Python then needs to support this operating system, despite the permission of the Unicode standard to ignore the difference.
I continue to disagree. The Unicode standard deliberately allows Python's behavior as conforming.
Wrt. normalization, I think all that's needed is already there. Applications just need to normalize all strings to a normal form of their liking, and be done. That's easier than using a separate library throughout the code base (let alone using yet another string type). Regards, Martin

On Thu, Aug 25, 2011 at 7:57 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I'd actually put it slightly differently: it seems to me that Python, in and of itself, can neither conform to nor violate that part of the standard, since conformance depends on how the *application* processes the data. However, we can make it harder or easier for applications to be conformant. UCS2 builds make it harder, since some code points have to be represented as code units internally. UCS4 builds and future PEP 393 builds (which should exhibit current UCS4 build semantics at the Python layer) make it easier, since the internal representation consistently uses code points, with code units only appearing as part of the encoding and decoding process. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

"Martin v. Löwis" writes:
There are no "weak SHOULDs" and no "wide-spread reasons" in RFC 2119. RFC 2119 specifies "particular circumstances" and "full implications" that are "carefully weighed" before varying from SHOULD behavior. IMHO the Unicode Standard intends a full RFC 2119 "SHOULD" here.
Sure, and that's one of several such reasons why I think the PEP's implementation of unicodes as arrays of code points is an optimal balance. But the Unicode standard does not "permit" ignoring the difference here, except in the sense that *the Unicode standard doesn't apply at all* and therefore doesn't forbid it. The OSes in question are not conforming processes, and presumably don't claim to be. Because most of the processes Python interacts with won't be conforming processes (not even the majority of textual applications, for a while), Python does not need to be, and *should not* be, a conforming Unicode process for most of what it does. Not even for much of its text processing. Also, to the extent that Python is a general-purpose language, I see nothing wrong and lots of good in having a non-conformant code point array type as the platform for implementing conforming Unicode library(ies). But this is not user/developer-friendly at all:
But many users have never heard of normalization. And that's *just* normalization. There is a whole raft of other requirements for conformance (collation, case, etc). The point is that with such a library and string type, various aspects of conformance to Unicode, as well as conformance to associated standards (eg, the dreaded UTS #18 ;-) can be added to the library over time, and most users (those who don't need to squeeze every ounce of performance out of Python) can be blissfully unaware of what, if anything, they're conforming to. Just upgrade the library to get the best Unicode support (in terms of conformance) that Python has to offer. But for the reasons you (and Guido and Nick and ...) give, it's not reasonable to put all that into core Python, not anytime soon. Not to mention that as a work-in-progress, it can hardly be considered stable enough for the stdlib. That is what Terry Reedy is getting at, AIUI. "Batteries included" should mean as much Unicode conformance as we can reasonably provide should be *conveniently* available. The ideal (given the caveat about efficiency) would be *one* import statement and a ConformingUnicode type that acts "just like a string" in all ways, except that (1) it indexes and counts on characters (preferably "grapheme clusters" :-), (2) does collation, regexps, and the like conformant to the Unicode standard, and (3) may be quite inefficient from the point of view of bit- shoveling net applications and the like. Of course most of (2) is going to take quite a while, but (1) and (3) should not be that hard to accomplish (especially (3) ;-).
That's up to you. I doubt very many users or application developers will see it that way, though. I think they would prefer that we be conservative about what we call "conformant", and tell them precisely what they need to do to get what they consider conformant behavior from Python. That's easier if we share definitions of conformant with them. And surely there would be great joy on the battlements if there were a one-import way to spell "all the Unicode conformance you can give me, please". The problem with your legalistic approach, as I see it, is that if our definition is looser than the users', all their surprises will be unpleasant. That's not good.

On Thu, Aug 25, 2011 at 4:58 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I see no alternative to explicitly spelling out what all operations do and let the user figure out whether that meets their needs. E.g. we needn't say that the str type or its == operator conforms to the Unicode standard. We just need to say that the string type is a sequence of code points, that string operations don't do validation or normalization, and that to do a comparison that takes the Unicode std's definition of equivalence (or collation, etc.) into account you must call a certain library method. -- --Guido van Rossum (python.org/~guido)

On Thu, Aug 25, 2011 at 2:39 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Does any OS actually say that? Don't they usually say "in a specific normal form" or "they're just bytes"?
The solution here is to let the OS do the check, e.g. with os.path.exists() or os.stat(). It would be wrong to write an app that checked for file existence by doing naive lookups in os.listdir() output. -- --Guido van Rossum (python.org/~guido)

Le 25/08/2011 06:12, Stephen J. Turnbull a écrit :
It took some weeks (months?) to write the PEP, and months to implement it. This PEP is only a minor change of the implementation of Unicode in Python. A larger change will take much more time (and maybe change/break the C and/or Python API a little bit more). If you are able to implement your specfication (a Unicode type with a "real" character API), please write a PEP and implement it. You may begin with a prototype in Python, and then rewrite it in C. But I don't think that any core developer will do that for you. It's not how free software works. At least, I don't think that anyone will do that for free :-) (I bet that many developers will accept to implement that for money :-)) Victor

On 8/24/2011 7:29 PM, Guido van Rossum wrote:
(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards."http://en.wikipedia.org/wiki/Stinking_badges :-)
Which deserves an appropriate, follow-on, misquote: Guido says the Unicode standard stinks. ˚͜˚ <- and a Unicode smiley to go with it!

I think he's referring to combining characters and normal forms. 2.12 starts with "In cases involving two or more sequences considered to be equivalent, the Unicode Standard does not prescribe one particular sequence as being the correct one; instead, each sequence is merely equivalent to the others" That could be read to imply that the == operator should determine whether two strings are equivalent. However, the Unicode standard clearly leaves API design to the programming environment, and has the notion of conformance only for processes. So saying that Python is or is not unicode-conforming is, strictly speaking, meaningless. The closest conformance requirement in that respect is C6 "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." However, that explicitly does *not* support the conformance statement that Stephen made. They elaborate "Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them." So practicality beats purity even in Unicode conformance: the == operator of Python can reasonably treat equivalent strings as unequal (and there is a good reason for that, indeed). Processes should not expect that other applications make the same distinction, so they need to cope if it matters to them. There are different way to do that: - normalize all strings on input, and then use == - use a different comparison operation that always normalizes its input first
Fortunately, it's much better than that. Unicode had very clear conformance requirements for a long time, and they aren't hard to meet. Wrt. C6, Python could certainly improve, e.g. by caching whether a string had been determined to be in normal form, so that applications can more reasonably apply normalization to all strings they ever want to compare. Regards, Martin

On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjreedy@udel.edu> wrote:
The naive reader also doesn't know the difference between characters, code points and code units. It's the advanced, Unicode-aware reader who is confused by this phrase in the docs. It should say code units; or perhaps code units for narrow builds and code points for wide builds. With PEP 393 we can unconditionally say code points, which is much better. We should try to remove our use of "characters" -- or else we should *define* our use of the term "characters" as "what the Unicode standard calls code points". -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <guido@python.org> wrote:
For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that: * for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but * only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair"). Since "code unit" refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4 bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code points" overlaps only for the ASCII range).
Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me. Note that there's also another important term[1]: """ *Unicode Scalar Value*. Any Unicode * code point<http://unicode.org/glossary/#code_point> * except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. """ For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]: Chapter 3 [4] says: """ 3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. """ On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates. I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.) Best Regards, Ezio Melotti [0]: From the chapter 3 [4], D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. [1]: http://unicode.org/glossary/#unicode_scalar_value [2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32.

On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti <ezio.melotti@gmail.com> wrote:
The more I think about it the more it seems to me that the biggest problem is that in narrow builds it is ambiguous whether (unicode) strings contain code units, i.e. are *encoded* code points, or whether they contain (decoded) code points. In a sense this is repeating the ambiguity of 8-bit strings in Python 2, which are sometimes assumed to contain ASCII or Latin-1 (i.e., code points with a limited range) or UTF-8 (i.e., code units). I know that by now I am repeating myself, but I think it would be really good if we could get rid of this ambiguity. PEP 393 seems the best way forward, even if it doesn't directly address what to do for IronPython or Jython, both of which have to deal with a pervasive native string type that contains UTF-16. IIUC, CPython on Windows will work just fine with PEP 393, even if it means that there is a bit more translation between Python strings and the OS native wchar_t[] type. I assume that the data volumes going through the OS APIs is relatively constrained, since data actually written to or read from a file will still be bytes, possibly run through a codec (if it's a text file), and not go through one of the wchar_t[] APIs -- the latter are used for things like filenames, which are much smaller.
Actually I think UTF-8 is best thought of as an encoding for code points, not characters -- the subtle difference between these two should be of no concern to the UTF-8 codec (unless it is a validating codec).
We may well have no choice -- there is just too much documentation that naively refers to characters while really referring to code units or code points.
This seems to involve validation. I think all validation should be sequestered to specific APIs (e.g. certain codecs) and the string type should not care about it. Depending on what they are doing, applications may have to be aware of many subtleties in order to always avoid generating "invalid" (or not well-formed-- what's the difference?) strings.
I really don't mind whether our codecs actually make exceptions for surrogates (lone or otherwise). The only requirement I care about is that surrogate-free strings round-trip correctly. Again, apps that want to conform to the requirements regarding surrogates can implement their own validation, and certainly at some point we should offer a validation library as part of the stdlib -- but it should be up to the app whether and when to use it.
Right.
I'm not more confused than I was, but I think we should reduce the number of Unicode terms we care about rather than increase them. If we only ever had to talk about code points and encoded byte sequences I'd be happy -- although in practice we also need to acknowledge the existence of characters that may be represented by multiple code points, since islower(), lower() etc. may need these (and also the re module). Other concepts we may have to at least acknowledge include various normal forms, equivalence, and collation sequences (which are language-dependent?). It would be lovely if someone wrote up an informational PEP so that we don't all have to lug around a copy of the Unicode standard.
-- --Guido van Rossum (python.org/~guido)

On 26 August 2011 03:52, Guido van Rossum <guido@python.org> wrote:
Hmm, I'm completely naive in this area, but from reading the thread, would a possible approach be to say that Python (the language definition) is defined in terms of code points (as we already do, even if the wording might benefit from some clarification). Then, under PEP 393, and currently in wide builds, CPython conforms to that definition (and retains the property of basic operations being O(1), which is not in the language definition but is a user expectation and your expressed requirement). IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. Presumably this will be easier than moving to a UCS-4 representation, as they can defer to runtime support routines via interop (which presumably get this right - or at the very least can be blamed for any errors :-)) They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics. Does this make sense, or have I completely misunderstood things? Paul. PS Thanks to all for the discussion in general, I'm learning a lot about Unicode from all of this!

That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property, and it may take well years until somebody provides an efficient unmaintainable implementation.
Does this make sense, or have I completely misunderstood things?
You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. However, non-conformance may not be that much of an issue. They do not conform in many other aspects, either (such as not supporting Python 3, for example, or not supporting the C API) that they may well chose to ignore such a minor requirement if there was one. For BMP strings, they conform fine, and it may well be that Jython eithers either don't have non-BMP strings, or don't care whether len() or indexing of their non-BMP strings is "correct". Regards, Martin

"Martin v. Löwis", 26.08.2011 11:29:
You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not.
I think we can leave this discussion aside. Jython and IronPython have their own platform specific constraints to which they need to adapt their implementation. For a Jython user, it means a lot to be able to efficiently pass strings (and other data) back and forth between Jython and other JVM code, and it's not hard to guess that the same is true for IronPython/.NET users. After all, the platform integration is the very *reason* for most users to select one of these implementations. Besides, what if these implementations provided indexing in, say, O(log N) instead of O(1) or O(N), e.g. by building a tree index into each string? You could have an index that simply marks runs of surrogate pairs and BMP substrings, thus providing a likely-to-be-somewhat-compact index. That index would obviously have to be built, but so do the different string representations in post-PEP-393 CPython, especially on Windows, as I have learned. Would such a less severe violation of the strict O(1) rule still be "not ok"? I think this is not such a clear black-and-white issue. Both implementations have notably different performance characteristics than CPython in some more or less important areas, as does PyPy. At some point, the language compliance label has to account for that. Stefan

On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
(And yet, you keep arguing. :-)
Right.
Eek. No, please. Those platforms' native string types have length and slicing operations that are O(1) and work in terms of 16-bit code points. Python should use those. It would be awful if Java and Python code doing the same manipulations on the same string would come to different conclusions because Python tried to paper over surrogates. I dug up some evidence for Java, at least: http://download.oracle.com/javase/1,5.0/docs/api/java/lang/CharSequence.html... """ length int length() Returns the length of this character sequence. The length is the number of 16-bit chars in the sequence. Returns: the number of chars in this sequence """ This is quite explicit about counting 16-bit code units. I've found similar info about .NET, which defines "char" as a 16-bit quantity and string length in terms of the number of "char" items.
Since you had to ask, I have to declare that, indeed, non-O(1) behavior would not be okay for those platforms. All in all, I don't think we should legislate Python strings to be able to support 21-bit code points using O(1) indexing. PEP 393 makes this possible for CPython, and it's been said that PyPy can follow suit. But it'll be a "quality-of-implementation" issue, not built into the language spec. -- --Guido van Rossum (python.org/~guido)

On 26 August 2011 17:51, Guido van Rossum <guido@python.org> wrote:
On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
(Regarding my comments on code point semantics)
On 26 August 2011 18:02, Guido van Rossum <guido@python.org> wrote:
*That* is actually the erroneous assumption I had made - that the Java and .NET native string type had code point semantics (i.e., took surrogates into account). As that isn't the case, my comments aren't valid - and I agree that having common semantics (and hence exposing surrogates) is too important to lose. On the other hand, that pretty much establishes that whatever PEP 393 achieves in terms of allowing all builds of CPython to offer code point semantics, the language definition can't mandate it. Thanks for the clarification. Paul.

On Fri, Aug 26, 2011 at 10:13 AM, Paul Moore <p.f.moore@gmail.com> wrote:
Those platforms probably *also* have libraries of operations to support writing apps that conform to the Unicode standard. But those apps will have to be aware of the difference between the "naive" length of a string and the number of code points of characters in it.
The most severe consequence to me seems that the stdlib (which is reused by those other platforms) cannot assume CPython's ideal world -- even if specific apps sometimes can. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 26.08.2011 19:02:
I was mostly just confabulating. My main point was that this isn't a black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. You can achieve compliant/acceptable behaviour at the code point level, the performance guarantees level or the platform integration level - choose any two. CPython is just lucky that there isn't really a platform integration level to take into account (if we leave the Windows environment aside for a moment).
I fully agree.
I take it that you say that because you want strings to perform in the 'normal' platform specific way here (i.e. like Java/.NET strings), and not so much because you want to require the exact same (performance) characteristics across Python implementations. So your choice is platform integration over code points, leaving the identical performance as a side-effect of the platform integration.
Makes sense to me. Most likely, Unicode heavy Python code will have to take platform specifics into account anyway, so there are limits as to what is suitable for a language spec. Stefan

On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Indeed.
I think this is fine. I had been hoping that all Python implementations claiming compatibility with version 3.3 of the language reference would be free of worries about surrogates, but it simply doesn't make sense. And yes, I'm well aware that PEP 393 is only for CPython. It's just that I had hoped that it would get rid of some of Tom C's specific complaints for all Python implementations; but it really seems impossible to do so. One consequence may be that the standard library, to the extent it is shared by other implementations, may still have to worry about surrogates and other issues inherent in narrow builds or other 16-bit-based string types. We'll cross that bridge when we get to it. -- --Guido van Rossum (python.org/~guido)

On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points, which is why Python's narrow builds are called UCS-2 and not UTF-16.
That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property,
Given that both 'efficient' and 'maintainable' are relative terms, that is you pessimistic opinion, not really a fact.
Why do you keep saying that O(n) is the alternative? I have already given a simple solution that is O(logk), where k is the number of non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I suspect that is the most time efficient possible without using at least as much space as a UCS-4 solution. The fact that you and other do not want this for CPython should not preclude other implementations that are more tied to UTF-16 from exploring the idea. Maintainability partly depends on whether all-codepoint support is built in or bolted on to a BMP-only implementation burdened with back compatibility for a code unit API. Maintainability is probably harder with a separate UTF-32 type, which CPython has but which I gather Jython and Iron-Python do not. It might or might not be easier is there were a separate internal character type containing a 32 bit code point value, so that interation and indexing (and single char slicing) always returned the same type of object regardless of whether the character was in the BMP or not. This certainly would help all the unicode database functions. Tom Christiansen appears to have said that Perl is or will use UTF-8 plus auxiliary arrays. If so, we will find out if they can maintain it. --- Terry Jan Reedy

On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I don't think anyone else has that impression. Please cite chapter and verse if you really think this is important. IIUC, UCS-2 does not allow surrogate pairs, whereas Python (and Java, and .NET, and Windows) 16-bit strings all do support surrogate pairs. And they all have a len or length function that counts code units, not code points.
Their API style is completely different from ours. What Perl can maintain has little bearing on what Python can. -- --Guido van Rossum (python.org/~guido)

On 8/26/2011 8:42 PM, Guido van Rossum wrote:
On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy<tjreedy@udel.edu> wrote:
For that reason, I think UTF-16 is a better term that UCS-2 for narrow builds (whether or not the above impression is true). But Marc Lemburg disagrees. http://mail.python.org/pipermail/python-dev/2010-November/105751.html The 2.7 docs still refer to usc2 builds, as is his wish. --- Terry Jan Reedy

On Aug 26, 2011, at 8:51 PM, Terry Reedy wrote:
I agree. It's weird to call something UCS-2 if code points above 65535 are representable. The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode. "The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP." Raymond

Raymond Hettinger writes:
The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode.
Sure. The operative word here is "codec", not "str", though.
Since when can s[0] represent a code point outside the BMP, for s a Unicode string in a narrow build? Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about what Python supports vs. the outside world. It's about what the str/ unicode type is an array of.

Antoine Pitrou writes:
Because what the outside world sees is produced by codecs, not by str. The outside world can't see whether you have narrow or wide unless it uses indexing ... ie, experiments to determine what the str type is an array of. The problem with a narrow build (whether for space efficiency in CPython or for platform compatibility in Jython and IronPython) is not that we have no UTF-16 codecs. It's that array ops aren't UTF-16 conformant.

Antoine Pitrou writes:
Sorry, what is a conformant UTF-16 array op?
For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16. (This is not a Unicode standard definition, it's intended to be suggestive of why many app writers will be distressed if they must use Python unicode/str in a narrow build without a fairly comprehensive library that wraps the arrays in operations that treat unicode/str as an array of code points.)

Guido van Rossum writes:
On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Well, that's why I wrote "intended to be suggestive". The Unicode Standard does not specify at all what the internal representation of characters may be, it only specifies what their external behavior must be when two processes communicate. (For "process" as used in the standard, think "Python modules" here, since we are concerned with the problems of folks who develop in Python.) When observing the behavior of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or even UTF-32 arrays; only arrays of characters. Thus, according to the rules of handling a UTF-16 stream, it is an error to observe a lone surrogate or a surrogate pair that isn't a high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and C8-C10). That's what I mean by "can't tell it's UTF-16". And I understand those requirements to mean that operations on UTF-16 streams should produce UTF-16 streams, or raise an error. Without that closure property for basic operations on str, I think it's a bad idea to say that the representation of text in a str in a pre-PEP-393 "narrow" build is UTF-16. For many users and app developers, it creates expectations that are not fulfilled. It's true that common usage is that an array of code units that usually conforms to UTF-16 may be called "UTF-16" without the closure properties. I just disagree with that usage, because there are two camps that interpret "UTF-16" differently. One side says, "we have an array representation in UTF-16 that can handle all Unicode code points efficiently, and if you think you need more, think again", while the other says "it's too painful to have to check every result for valid UTF-16, and we need a UTF-16 type that supports the usual array operations on *characters* via the usual operators; if you think otherwise, think again." Note that despite the (presumed) resolution of the UTF-16 issue for CPython by PEP 393, at some point a very similar discussion will take place over "characters" anyway, because users and app developers are going to want a type that handles composition sequences and/or grapheme clusters for them, as well as comparison that respects canonical equivalence, even if it is inefficient compared to str. That's why I insisted on use of "array of code points" to describe the PEP 393 str type, rather than "array of characters".

On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:
On topic: So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways: Arrays are not suitable for manipulating Unicode character sequences, and the str type is an array with a veneer of text manipulation operations, which do not, and cannot, by themselves, efficiently implement Unicode character sequences. Python wants to, should, and can implement UTF-16 streams, UTF-8 streams, and UTF-32 streams. It should, and can implement streams using other encodings as well, and also binary streams. Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 64-bit arrays. These are efficient to access, index, and slice. Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays called str (this will be more true post-PEP 393, although it is true with caveats presently), which interpret array elements as code units (currently) or codepoints (post-PEP), and implements operations that are interesting for text processing, with caveats. There is presently no support for arrays of Unicode grapheme clusters or composed characters. The Python type called str may or may not be properly documented (to the extent that there is confusion between the actual contents of the elements of the type, and the concept of character as defined by Unicode). From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster concepts. The str type itself can presently be used to process other character encodings: if they are fixed width < 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they are, and some operations on str may operate with knowledge of some Unicode semantics, so there are caveats. So it seems that any semantics in support of composed characters, grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be created as either an add-on Python package (in Python) or C extension, or a combination. It could be based on extensions to the existing str type, or it could be based on the array type, or it could based on the bytes type. It could use an internal format of 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or 16-bit codeunits. In addition to the expected stream operations, character length, indexing, and slicing operations, additional more complex operations would be expected on Unicode string values: regular expressions, comparisons, collations, case-shifting, and perhaps more. RTL and LTR awareness would add complexity to all operations, or at least variants of all operations. The questions are: 1) Is anyone interested in writing a PEP for such a thing? 2) Is anyone interested in writing an implementation for such a thing? 3) How many conflicting opinions and arguments will be spawned, making the poor person or persons above lose interest? Brainstorming ideas (which may wander off-topic in some regards, but were all inspired by this discussion): BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is smallest on the average language, and an implementation based on a foundation type of bytes or 'B' arrays, plus some side indexes of some sort, could be an appropriate starting point. UTF-8 is variable length, but so are composed characters and grapheme clusters. Building an array, each of whose units could hold the largest grapheme cluster would seem extremely inefficient, just like 32-bit Unicode is extremely inefficient for dealing with ASCII, so variable length units seem to be an imperative part of a solution. At least until one thinks up BI-2. BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache character characteristics from various character attribute database lookups could be an effective alternative, but wouldn't eliminate the need for dealing with variable length units for length, indexing, and slicing operations. BI-2: Maybe a 32-bit base would be useful so that one high bit could be used to flag that this character position actually holds an index to a multi-codepoint character, and the index would then hold the actual codes for that character. This would allow for at most 2^31 (and memory limited) different multi-codepoint characters in a string (or perhaps per application, if the multi-codepoint characters are shared between strings), but would suddenly allow array indexing of grapheme clusters and composed characters... with double-indexing required for multi-codepoint character access. [This idea seems similar to one that was mentioned elsewhere in this thread, suggesting that private use characters could be used to represent multi-codepoint characters, but (a) doesn't infringe on private uses, and (b) allows for a greater number of multi-codepoint characters to be used.] BI-3: both BI-1 and BI-2 would also allow themselves to be built on top of PEP 393 str... allowing multi-codepoint-character-supporting applications to benefit from the space efficiencies of PEP 393 when no multi-codepoint characters are fed into the application. BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array elements might be appropriate, rather than 32-bit. BI-2 could still operate, with a theoretical reduction to 2^23 possible multi-codepoint characters in an application. Access would be less efficient, but still O(1), and 25% of the space would be saved. This idea could be applied to PEP 393 independently of multi-codepoint character support. BI-5: I'm pretty sure there are inappropriate or illegal sequences of combining characters that should not stand alone. One example of this is lone surrogates. Such characters out of an appropriate sequence could be flagged with a high-bit so that they could be quickly recognized as illegal Unicode, but codecs could be provided to allow them to round-trip, and applications could recognize immediately that they should be handled as "binary gibberish" in an otherwise Unicode stream. This idea could be applied to PEP 393 independently of additional multi-codepoint character support. BI-6: Maybe another high bit could be used with a different codec error handler instead of using lone surrogates when decoding not-quite-conformant byte streams (such as OS filenames). Sad we didn't think of this one before doing all the lone surrogate stuff. Of course, this solution wouldn't work on narrow builds, because not even surrogates can represent high bits above Unicode codepoints! But once we have PEP 393, we _could_ replace inappropriate use of lone surrogates, with use of out-of-the-Unicode-codepoint range integers, without introducing ambiguity in the interpretation of lone surrogates. This idea could be applied to PEP 393 independently of multi-codepoint character support. Glenn

Glenn Linderman writes:
IMO, that would be a bad idea, as higher-level Unicode support should either be a wrapper around full implementations such as ICU (or platform support in .NET or Java), or written in pure Python at first. Thus there is a need for an efficient array of code units type. PEP 393 allows this to go to the level of code points, but evidently that is inappropriate for Jython and IronPython.
The str type itself can presently be used to process other character encodings:
Not really. Remember, on input codecs always decode to Unicode and on output they always encode from Unicode. How do you propose to get other encodings into the array of code units?
In theory yes, but in practice all of the string methods and libraries like re operate on str (and often but not always bytes; in particular, codecs always decode from byte and encode to bytes). Why bother with anything except arrays of code points at the start? PEP 393 makes that time-efficient and reasonably space-efficient as a starting point and allows starting with re or MRAB's regex to get basic RE functionality or good UTS #18 functionality respectively. Plus str already has all the usual string operations (.startswith(), .join(), etc), and we have modules for dealing with the Unicode Character Database. Why waste effort reintegrating with all that, until we have common use cases that need more efficient representation? There would be some issue in coming up with an appropriate UTF-16 to code point API for Jython and IronPython, but Terry Reedy has a rather efficient library for that already. So this discussion of alternative representations, including use of high bits to represent properties, is premature optimization ... especially since we don't even have a proto-PEP specifying how much conformance we want of this new "true Unicode" type in the first place. We need to focus on that before optimizing anything.

On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:
OK you agree with Guido.
Here are two ways, there may be more: custom codecs, direct assignment
String methods could be reimplemented on any appropriate type, of course. Rejecting alternatives too soon might make one miss the best design.
Yes, Terry's implementation is interesting, and inspiring, and that concept could be extended to a variety of interesting techniques: codepoint access of code unit representations, and multi-codepoint character access on top of either code unit or codepoint representations.
You may call it premature optimization if you like, or you can ignore the concepts and emails altogether. I call it brainstorming for ideas, looking for non-obvious solutions to the problem of representation of Unicode. I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when processes communicate and for storage (which is one way processes communicate), that doesn't imply that the internal representation of character strings in a programming language must use exactly that representation. While there are efficiencies in using the same representation as is used by the communications streams, there are also inefficiencies. I'm unaware of any current Python implementation that has chosen to use UTF-8 as the internal representation of character strings (I'm also aware Perl has made that choice), yet UTF-8 is one of the commonly recommend character representations on the Linux platform, from what I read. So in that sense, Python has rejected the idea of using the "native" or "OS configured" representation as its internal representation. So why, then, must one choose from a repertoire of Unicode-defined stream representations if they don't meet the goal of efficient length, indexing, or slicing operations on actual characters?

Glenn Linderman writes:
That is true, and Unicode is *very* careful to define its requirements so that is true. That doesn't mean using an alternative representation is an improvement, though.
There are two reasons for that. First, widechar representations are right out for anything related to the file system or OS, unless you are prepared to translate before passing to the OS. If you use UTF-8, then asking the user to use a UTF-8 locale to communicate with your app is a plausible way to eliminate any translation in your app. (The original moniker for UTF-8 was UTF-FSS, where FSS stands for "file system safe.") Second, much text processing is stream-oriented and one-pass. In those cases, the variable-width nature of UTF-8 doesn't cost you anything. Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. It costs *them* nothing and is file-system-safe.
I can't agree with that characterization. POSIX defines the concept of *locale* precisely because the "native" representation of text in Unix is ASCII. Obviously that won't fly, so they solved the problem in the worst possible way<wink/>: they made the representation variable! It is the *variability* of text representation that Python rejects, just as Emacs and Perl do. They happen to have chosen six different representations.[1]
One need not. But why do anything else? It's not like the authors of that standard paid no attention to various concerns about efficiency and backward compatibility! That's the question that you have not answered, and I am presently lacking in any data that suggests I'll ever need the facilities you propose. Footnotes: [1] Emacs recently changed its mind. Originally it used the so-called MULE encoding, and now a different extension of UTF-8 from Perl. Of course, Python beats that, with narrow, wide, and now PEP-393 representations!<wink />

Stephen J. Turnbull:
... Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text.
Qt uses UTF-16 for its basic QString type. While QString is mostly treated as a black box which you can create from input buffers in any encoding, the only encoding allowed for a contents-by-reference QString (QString::fromRawData) is UTF-16. http://doc.qt.nokia.com/latest/qstring.html#fromRawData Neil

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
I think this is too strong. The str type is indeed an array, and you can build useful Unicode manipulation APIs on top of it. Just like bytes are not UTF-8, but can be used to represent UTF-8 and a fully-compliant UTF-8 codec can be implemented on top of it. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:12 AM, Guido van Rossum wrote:
This statement is a logical conclusion of arguments presented in this thread. 1) Applications that wish to do grapheme access, wish to do it by grapheme array indexing, because that is the efficient way to do it. 2) As long as str is restricted to holding Unicode code units or code points, then it cannot support grapheme array indexing efficiently. I have not declared that useful Unicode manipulations APIs cannot be built on top of str, only that efficiency will suffer.

On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman <v+python@g.nevcal.com>wrote:
I don't believe that should be taken as gospel. In Perl, they don't do array indexing on strings at all, and use regex matching instead. An API that uses some kind of cursor on a string might work fine in Python too (for grapheme matching). 2) As long as str is restricted to holding Unicode code units or code
But you have not proven it. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 11:56 AM, Guido van Rossum wrote:
The last benchmark I saw, regexp in Perl is faster than regexp in Python; that was some years back, before regexp in Perl supported quite as much Unicode as it does now; not sure if someone has done a recent performance benchmarks; Tom's survey indicates that the functionality presently differs, so it is not clear if performance benchmarks are presently appropriate to attempt to measure Unicode operations in regexp between the two languages. That said, regexp, or some sort of cursor on a string, might be a workable solution. Will it have adequate performance? Perhaps, at least for some applications. Will it be as conceptually simple as indexing an array of graphemes? No. Will it ever reach the efficiency of indexing an array of graphemes? No. Does that matter? Depends on the application.
Do you disagree that indexing an array is more efficient than manipulating strings with regex or binary trees? I think not, because you are insistent that array indexing of str be preserved as O(1). I agree that I have not proven it; it largely depends on whether or not indexing by grapheme cluster is a useful operation in applications. Yet Stephen (I think) has commented that emacs performance goes down as soon as multi-byte characters are introduced into an edit buffer. So I think he has proven that efficiency can suffer, in some implementations/applications. Terry's O(k) implementation requires data beyond strings, and isn't O(1).

Glenn Linderman:
Using an iterator for cluster access is a common technique currently. For example, with the Pango text layout and drawing library, you may create a PangoLayoutIter over a text layout object (which contains a UTF-8 string along with formatting information) and iterate by clusters by calling pango_layout_iter_next_cluster. Direct access to clusters by index is not as useful in this domain as access by pixel positions - for example to examine the portion of a layout visible in a window. http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layo... In this API, 'index' is used to refer to a byte index into UTF-8, not a character or cluster index. Rather than discuss functionality in the abstract, we need some use cases involving different levels of character and cluster access to see whether providing indexed access is worthwhile. I'll start with an example: some text drawing engines draw decomposed characters ("o" followed by " ̈" -> "ö") differently compared to their composite equivalents ("ö") and this may be perceived as better or worse. I'd like to offer an option to replace some decomposed characters with their composite equivalent before drawing but since other characters may look worse, I don't want to do a full normalization. The API style that appears most useful for this example is an iterator over the input string that yields composed and decomposed character strings (that is, it will yield both "ö" and "ö"), each character string is then converted if in a substitution dictionary and written to an output string. This is similar to an iterator over grapheme clusters although, since it is only aimed at composing sequences, the iterator could be simpler than a full grapheme cluster iterator. One of the benefits of iterator access to text is that many different iterators can be built without burdening the implementation object with extra memory costs as would be likely with techniques that build indexes into the representation. Neil

Guido van Rossum:
No, since normalization of all cases may actually lead to worse visuals in some situations. A potential reason for drawing decomposed characters differently is that more room may be allocated for the generic condition where a character may be combined with a wide variety of accents compared with combining it with a specific accent. Here is an example on Windows drawing composite and decomposed forms to show the types of difference often encountered. http://scintilla.org/Composite.png Now, this particular example displays both forms quite reasonably so would not justify special processing but I have seen on other platforms and earlier versions of Windows where the umlaut in the decomposed form is displaced to the right even to the extent of disappearing under the next character. In the example, the decomposed 'o' is shorter and lighter and the umlauts are round instead of square. Neil

On Wed, Aug 31, 2011 at 6:29 PM, Neil Hodgson <nyamatongwe@gmail.com> wrote:
Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word.
I'm not sure it's a good idea to try and improve on the font using such a hack. But I won't deny you have the right. :-) -- --Guido van Rossum (python.org/~guido)

Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word.
If I understood the example correctly, he needs a mixed form, with some characters decomposed and some composed (depending on which one looks better in the given font). I agree that this sound more like a font problem, but it's a wide spread font problem and it may be necessary to address it in an application. But this is only one example of why an application-specific concept of graphemes different from the Unicode-defined normalized forms can be useful. I think the very concept of a grapheme is context, language, and culture specific. For example, in Chinese Pinyin it would be very natural to write tone marks with composing diacritics (i.e. in decomposed form). But then you have the vowel "ü" and it would be strange to decompose it into an "u" and combining diaeresis. So conceptually the most sensible representation of "lǜ" would be neither the composed not the decomposed normal form, and depending on its needs an application might want to represent it in the mixed form (composing the diaeresis with the "u", but leaving the grave accent separate). There must be many more examples where the conceptual context determines the right composition, like for "ñ", which is Spanish is certainly a grapheme, but in mathematics might be better represented as n-tilde. The bottom line is that, while an array of Unicode code points is certainly a generally useful data type (and PEP 393 is a great improvement in this regard), an array of graphemes carries many subtleties and may not be nearly as universal. Support in the spirit of unicodedata's normalization function etc. is certainly a good thing, but we shouldn't assume that everyone will want Python to do their graphemes for them. - Hagen

On 8/31/2011 5:58 PM, Neil Hodgson wrote:
I agree that different applications may have different needs for different types of indexes to various starting points in a large string. Where a custom index is required, a standard index may not be needed.
How many different iterators into the same text would be concurrently needed by an application? And why? Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator. Of course, if it does I/O it needs codec access, but that is by nature sequential from the starting point to the end point.

Glenn Linderman:
I would expect that there would mostly be a single iterator into a string but can imagine scenarios in which multiple iterators may be concurrently active and that these could be of different types. For example, say we wanted to search for each code point in a text that fails some test (such as being a member of a set of unwanted vowel diacritics) and then display that failure in context with its surrounding text of up to 30 graphemes either side. Neil

Glenn Linderman writes:
How many different iterators into the same text would be concurrently needed by an application? And why?
A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the "source" window and one for the "rendered" window. One might want to save the state of the iterators (if that's possible) and cache it as one moves the "window" forward to make short backward motion fast, giving you two (or four, etc) more.
`save-region' ? `save-text-remove-markup' ?

On 9/1/2011 2:15 AM, Stephen J. Turnbull wrote:
Sure. But those are probably all the same type of iterators — probably (since they are WYSIWYG) dealing with multi-codepoint characters (Guido's recent definition of grapheme, which seems to subsume both grapheme clusters and composed characters). Hence all of them would be using/requiring the same sort of representation, index, analysis, or some combination of those.
Yes, save-region sounds like exactly what I was speaking of. save-text-remove-markup I would infer needs to process the text to remove the markup characters... since you used TeX and HTML as examples, markup is text, not binary (which would be a different problem). Since the TeX and HTML markup is mostly ASCII, markup removal (or more likely, text extraction) could be performed via either a grapheme iterator, or a codepoint iterator, or even a code unit iterator.

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
Actually, the str type in Python 3 and the unicode type in Python 2 are constrained everywhere to either 16-bit or 21-bit "characters". (Except when writing C code, which can do any number of invalid things so is the equivalent of assuming 1 == 0.) In particular, on a wide build, there is no way to get a code point >= 2**21, and I don't want PEP 393 to change this. So at best we can use these types to repesent arrays of 21-bit unsigned ints. But I think it is more useful to think of them as always representing "some form of Unicode", whether that is UTF-16 (on narrow builds) or 21-bit code points or perhaps some vaguely similar superset -- but for those code units/code points that are representable *and* valid (either code points or code units) according to the (supported version of) the Unicode standard, the meaning of those code points/units matches that of the standard. Note that this is different from the bytes type, where the meaning of a byte is entirely determined by what it means in the programmer's head. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:20 AM, Guido van Rossum wrote:
Sorry, my Perl background is leaking through. I didn't double check that str constrains the values of each element to range 0x110000 but I see now by testing that it does. For some of my ideas, then, either a subtype of str would have to be able to relax that constraint, or str would not be the appropriate base type to use (but there are other base types that could be used, so this is not a serious issue for the ideas). I have no problem with thinking of str as representing "some form of Unicode". None of my proposals change that, although they may change other things, and may invent new forms of Unicode representations. You have stated that it is better to document what str actually does, rather than attempt to adhere slavishly to Unicode standard concepts. The Unicode Consortium may well define legal, conforming bytestreams for communicating processes, but languages and applications are free to use other representations internally. We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams, or we can invent a representation (whether called str or something else) that is useful and efficient in practice.

Glenn Linderman writes:
We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams,
It's not artificial. Having the internal representation be the same as a standard encoding is very useful for a large number of minor usages (urgently saving buffers in a text editor that knows its internal state is inconsistent, viewing strings in the debugger, PEP 393-style space optimization is simpler if text properties are out-of-band, etc).
or we can invent a representation (whether called str or something else) that is useful and efficient in practice.
Bring on the practice, then. You say that a bit to identify lone surrogates might be useful or efficient. In what application? How much time or space does it save? You say that a bit to cache a property might be useful or efficient. In what application? Which properties? Are those properties a set fixed by the language, or would some bits be available for application-specific property caching? How much time or space does that save? What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393? I know of no answers (none!) to those questions that favor introduction of a bit-cache representation now. And those bits aren't going anywhere; it will always be possible to use a "wide" build and change the representation later, if the optimization is valuable enough. Now, I'm aware that my experience is limited to the implementations of one general-purpose language (Emacs Lisp) of retricted applicability. But its primary use *is* in text processing, so I'm moderately expert. *Moderately*. Always interested in learning more, though. If you know of relevant use cases, I'm listening! Even if Guido doesn't find them convincing for Python, we might find them interesting at XEmacs.

On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote:
saving buffers urgently when the internal state is inconsistent sounds like carefully preserving a bug. Windows 7 64-bit on one of my computers happily crashes several times a day when it detects inconsistent internal state... under the theory, I guess, that losing work is better than saving bad work. You sound the opposite. I'm actually very grateful that Firefox and emacs recover gracefully from Windows crashes, and I lose very little data from the crashes, but cannot recommend Windows 7 (this machine being my only experience with it) for stability. In any case, the operations you mention still require the data to be processed, if ever so slightly, and I'll admit that a more complex representation would require a bit more processing. Not clear that it would be huge or problematical for these cases. Except, I'm not sure how PEP 393 space optimization fits with the other operations. It may even be that an application-wide complex-grapheme cache would save significant space, although if it uses high-bits in a string representation to reference the cache, PEP 393 would jump immediately to something > 16 bits per grapheme... but likely would anyway, if complex-graphemes are in the data stream.
I didn't attribute any efficiency to flagging lone surrogates (BI-5). Since Windows uses a non-validated UCS-2 or UTF-16 character type, any Python program that obtains data from Windows APIs may be confronted with lone surrogates or inappropriate combining characters at any time. Round-tripping that data seems useful, even though the data itself may not be as useful as validated Unicode characters would be. Accidentally combining the characters due to slicing and dicing the data, and doing normalizations, or what not, would not likely be appropriate. However, returning modified forms of it to Windows as UCS-2 or UTF-16 data may still cause other applications to later accidentally combine the characters, if the modifications juxtaposed things to make them look reasonably, even if accidentally. If intentionally, of course, the bit could be turned off. This exact sort of problem with non-validated UTF-8 bytes was addressed already in Python, mostly for Linux, allowing round-tripping of the byte stream, even though it is not valid. BI-6 suggests a different scheme for that, without introducing lone surrogates (which might accidentally get combined with other lone surrogates).
The brainstorming ideas I presented were just that... ideas. And they were independent. And the use of many high-order bits for properties was one of the independent ones. When I wrote that one, I was assuming a UTF-32 representation (which wastes 11 bits of each 32). One thing I did have in mind, with the high-order bits, for that representation, was to flag the start or end or middle of the codes that are included in a grapheme. That would be redundant with some of the Unicode codepoint property databases, if I understand them properly... whether it would make iterators enough more efficient to be worth the complexity would have to be benchmarked. After writing all those ideas down, I actually preferred some of the others, that achieved O(1) real grapheme indexing, rather than caching character properties.
What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393?
If it is a separate type from str, then it costs nothing except the extra code space to implement the cache for those applications that do want it... most of which wouldn't be loaded for applications that don't, if done as a module or C extension.
OK... ignore the bit-cache idea (BI-1), and reread the others without having your mind clogged with that one, and see if any of them make sense to you then. But you may be too biased by the "minor" needs of keeping the internal representation similar to the stream representation to see any value in them. I rather like BI-2, since it allow O(1) indexing of graphemes.

Glenn Linderman writes:
Definitely. Windows apps habitually overwrite existing work; saving when inconsistent would be a bad idea. The apps I work on dump their unsaved buffers to new files, and give you a chance to look at them before instating them as the current version when you restart.
The only language I know of that uses thousands of complex graphemes is Korean ... and the precomposed forms are already in the BMP. I don't know how many accented forms you're likely to see in Vietnamese, but I suspect it's less than 6400 (the number of characters in private space in the BMP). So for most applications, I believe that mapping both non-BMP code points and grapheme clusters into that private space should be feasible. The only potential counterexample I can think of is display of Arabic, which I have heard has thousands of glyphs in good fonts because of the various ways ligatures form in that script. However AFAIK no apps encode these as characters; I'm just admitting that it *might* be useful. This will require some care in registering such characters and clusters because input text may already use private space according to some convention, which would need to be respected. Still, 6400 characters is a lot, even for the Japanese (IIRC the combined repertoire of "corporate characters" that for some reason never made it into the JIS sets is about 600, but almost all of them are already in the BMP). I believe the total number of Japanese emoticons is about 200, but I doubt that any given text is likely to use more than a few. So I think there's plenty of space there. This has a few advantages: (1) since these are real characters, all Unicode algorithms will apply as long as the appropriate properties are applied to the character in the database, and (2) it works with a narrow code unit (specifically, UCS-2, but it could also be used with UTF-8). If you really need more than 6400 grapheme clusters, promote to UTF-32, and get two more whole planes full (about 130,000 code points).
I don't think so. AFAIK all that data must pass through a codec, which will validate it unless you specifically tell it not to.
Round-tripping that data seems useful,
The standard doesn't forbid that. (ISTR it did so in the past, but what is required in 6.0 is a specific algorithm for identifying well-formed portions of the text, basically "if you're currently in an invalid region, read individual code units and attempt to assemble a valid sequence -- as soon as you do, that is a valid code point, and you switch into valid state and return to the normal algorithm".) Specifically, since surrogates are not characters, leaving them in the data does not constitute "interpreting them as characters." I don't recall if any of the error handlers allow this, though.
In CPython AFAIK (I don't do Windows) this can only happen if you use a non-default error setting in the output codec.
If you need O(1) grapheme indexing, use of private space seems a winner to me. It's just defining private precombined characters, and they won't bother any Unicode application, even if they leak out.
I'm talking about the bit-cache (which all of your BI-N referred to, at least indirectly). Many applications will want to work with fully composed characters, whether they're represented in a single code point or not. But they may not care about any of the bit-cache ideas.
No, I'm biased by the fact that I already good ways to do them without leaving the set of representations provided by Unicode (often ways which provide additional advantages), and by the fact that I myself don't know any use cases for the bit-cache yet.
I rather like BI-2, since it allow O(1) indexing of graphemes.
I do too (without suggesting a non-standard representation, ie, using private space), but I'm sure that wheel has been reinvented quite frequently. It's a very common trick in text processing, although I don't know of other applications where it's specifically used to turn data that "fails to be an array just a little bit" into a true array (although I suppose you could view fixed-width EUC encodings that way).

On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote: [me]
Hm, that's not how I would read "process". IMO that is an intentionally vague term, and we are free to decide how to interpret it. I don't think it will work very well to define a process as a Python module; what about Python modules that agree about passing along array of code units (or streams of UTF-8, for that matter)? This is why I find the issue of Python, the language (and stdlib), as a whole "conforming to the Unicode standard" such a troublesome concept -- I think it is something that an application may claim, but the language should make much more modest claims, such as "the regular expression syntax supports features X, Y and Z from the Unicode recommendation XXX, or "the UTF-8 codec will never emit a sequence of bytes that is invalid according Unicode specification YYY". (As long as the Unicode references are also versioned or dated.) I'm fine with saying "it is hard to write Unicode-conforming application code for reason ZZZ" and proposing a fix (e.g. PEP 393 fixes a specific complaint about code units being inferior to code points for most types of processing). I'm not fine with saying "the string datatype should conform to the Unicode standard".
But if you can observe (valid) surrogate pairs it is still UTF-16.
Ok, I dig this, to some extent. However saying it is UCS-2 is equally bad. I guess this is why Java and .NET just say their string types contain arrays of "16-bit characters", with essentially no semantics attached to the word "character" besides "16-bit unsigned integer". At the same time I think it would be useful if certain string operations like .lower() worked in such a way that *if* the input were valid UTF-16, *then* the output would also be, while *if* the input contained an invalid surrogate, the result would simply be something that is no worse (in particular, those are all mapped to themselves). We could even go further and have .lower() and friends look at graphemes (multi-code-point characters) if the Unicode std has a useful definition of e.g. lowercasing graphemes that differed from lowercasing code points. An analogy is actually found in .lower() on 8-bit strings in Python 2: it assumes the string contains ASCII, and non-ASCII characters are mapped to themselves. If your string contains Latin-1 or EBCDIC or UTF-8 it will not do the right thing. But that doesn't mean strings cannot contain those encodings, it just means that the .lower() method is not useful if they do. (Why ASCII? Because that is the system encoding in Python 2.)
I think we should just document how it behaves and not get hung up on what it is called. Mentioning UTF-16 is still useful because it indicates that some operations may act properly on surrogate pairs. (Also because of course character properties for BMP characters are respected, etc.)
Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous) -- they are sequences of multiple code points that represent a single "visual squiggle" (the kind of thing that you'd want to be swappable in vim with "xp" :-). I agree that APIs are needed to manipulate (match, generate, validate, mutilate, etc.) things at the grapheme level. I don't agree that this means a separate data type is required. There are ever-larger units of information encoded in text strings, with ever farther-reaching (and more vague) requirements on valid sequences. Do you want to have a data type that can represent (only valid) words in a language? Sentences? Novels? I think that at this point in time the best we can do is claim that Python (the language standard) uses either 16-bit code units or 21-bit code points in its string datatype, and that, thanks to PEP 393, CPython 3.3 and further will always use 21-bit code points (but Jython and IronPython may forever use their platform's native 16-bit code unit representing string type). And then we add APIs that can be used everywhere to look for code points (even if the string contains code points), graphemes, or larger constructs. I'd like those APIs to be designed using a garbage-in-garbage-out principle, where if the input conforms to some Unicode requirement, the output does too, but if the input doesn't, the output does what makes most sense. Validation is then limited to codecs, and optional calls. If you index or slice a string, or create a string from chr() of a surrogate or from some other value that the Unicode standard considers an illegal code point, you better know what you are doing. I want chr(i) to be valid for all values of i in range(2**21), so it can be used to create a lone surrogate, or (on systems with 16-bit "characters") a surrogate pair. And also ord(chr(i)) == i for all i in range(2**21). I'm not sure about ord() on a 2-character string containing a surrogate pair on systems where strings contain 21-bit code points; I think it should be an error there, just as ord() on other strings of length != 1. But on systems with 16-bit "characters", ord() of strings of length 2 containing a valid surrogate pair should work. -- --Guido van Rossum (python.org/~guido)

On 8/31/2011 10:10 AM, Guido van Rossum wrote:
So if Python 3.3+ uses Unicode codepoints as its str representation, the analogy to ASCII and Python 2 would imply that it should permit out-of-range codepoints, if they can be represented in the underlying data values. Valid codecs would not create such on input, and Valid codecs would not accept such on output. Operations on codepoints should, like .lower(), use the identity operation when applied to non-codepoints.
Interesting ideas. Once you break the idea that every code point must be directly indexed, and that higher level concepts can be abstracted, appropriate codecs could produce a sequence of words, instead of characters. It depends on the purpose of the application whether such is interesting or not. Working a bit with ebook searching algorithms lately, and one idea is to extract from the text a list of words, and represent the words with codes. Do the same for the search string. Then the search, instead of searching for characters and character strings, and skipping over punctuation, etc., it can simply search for the appropriate sequence of word codes. In this case, part of the usefulness of the abstraction is the elimination of punctuation, so it is more of an index to the character text rather an encoding of it... but if the encoding of the text extracted words, the creation of the index would then be extremely simple. I don't have applications in mind where representing sentences or novels would be particularly useful, but representing words could be extremely useful. Valid words? Given a language (or languages) and dictionary (or dictionaries), words could be flagged as valid or invalid for that dictionary. Representing invalid words, could be similar to the idea of the representing of invalid UTF-8 bytes using the lone-surrogate error handler... possible when the application requests such.
So limiting the code point values to 21-bits (wasting 11 bits) only serves to prevent applications from using those 11 bits when they have extra-Unicode values to represent. There is no shortage of 32-bit datatypes to draw from, though, but it seems an unnecessary constraint if exact conformance to Unicode is not provided... conforming codecs wouldn't create such values on input nor accept them on output, so the constraint only serves to restrict applications from using all 32-bits of the underlying storage.
Yep. So str != Unicode. You keep saying that :) And others point out how some applications would benefit from encapsulating the complexities of Unicode semantics at various higher levels of abstractions. Sure, it can be tacked on, by adding complex access methods to a subtype of str, but losing O(1) indexing of those higher abstractions when that route is chosen.

On 8/31/2011 1:10 PM, Guido van Rossum wrote:
This will be a great improvement. It was both embarrassing and frustrating to have to respond to Tom C.'s (and other's) issue with "Our unicode type is too vaguely documented to tell whether you are reporting a bug or making a feature request.
As I said on the tracker, our narrow builds are in-between (while moving closer to UTF-16), and both terms are deceptive, at least to some.
Good analogy.
I presume by 'separate data type' you mean a base level builtin class like int or str and that you would allow for wrapper classes built on top of str, as such are not really 'separate'. For grapheme leval and higher, we should certainly start with wrappers and probably with alternate versions based on different strategies.
Actually, it is range(0X110000) == range(1114112) so that UTF-8 uses at most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.
for i in range(0x110000): # 1114112 if ord(chr(i)) != i: print(i) # prints nothing (on Windows)
And now does, thanks to whoever fixed this (withing the last year, I think). -- Terry Jan Reedy

On Thu, Sep 1, 2011 at 8:02 AM, Terry Reedy <tjreedy@udel.edu> wrote:
We should probably just explicitly document that the internal representation in narrow builds is a UCS-2/UTF-16 hybrid - like UTF-16, it can handle the full code point space, but, like UCS-2, it allows code unit sequences (such as lone surrogates) that strict UTF-16 would reject. Perhaps we should also finally split strings out to a dedicated section on the same tier as Sequence types in the library reference. Yes, they're sequences, but they're also so much more than that (try as you might, you're unlikely to be successful in ducktyping strings the way you can sequences, mappings, files, numbers and other interfaces. Needing a "real string" is even more common than needing a "real dict", especially after the efforts to make most parts of the interpreter that previously cared about the latter distinction accept arbitrary mapping objects). I've created http://bugs.python.org/issue12874, suggesting that the "Sequence Types" and "memoryview type" sections could be usefully rearranged as: Sequence Types - list, tuple, range Text Data - str Binary Data - bytes, bytearray, memoryview Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Where I cut your words, we are in 100% agreement. (FWIW :-) Guido van Rossum writes:
On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I agree. I'm sorry that I didn't make myself clear. The reason I read "process" as "module" is that some modules of Python, and therefore Python as a whole, cannot conform to the Unicode standard. Eg, anything that inputs or outputs bytes. Therefore only "modules" and "types" can be asked to conform. (I don't think it makes sense to ask anything lower level to conform. See below where I comment on your .lower() example.) What I am advocating (for the long term) is provision of *one* module (or type) such that if the text processing done by the application is done entirely in terms of this module (type), it will conform (to some specified degree, chosen to balance user wants with implementation and support costs). It may be desireable to provide others for sufficiently important particular use cases, but at present I see a clear need for *one*. Unicode conformance is going to be a common requirement for apps used by global enterprises. I oppose trying to make str into that type. We need str, just as it is, for many reasons.
Certainly a group of cooperating modules could form a conforming process, just as you describe it for one example. The "one module" mentioned above need not implement everything internally, but it would take responsiblity for providing guarantees (eg, unit tests) of whatever conformance claims it makes.
In the concrete implementation I have in mind, surrogate pairs are represented by a str containing 2 code units. But in that case s[i][1] is an error, and s[i][0] == s[i]. print(s[i][0]) and print(s[i]) will print the same character to the screen. If you decode it to bytes, well, it's not a str any more so what have you proved? Ie, what you will see is *code points* not in the BMP. You don't have to agree that such "surrogate containment" behavior is so valuable as I think it is, but that's what I have in mind as one requirement for a "conforming implementation of UTF-16".
I don't think that it's a good idea to go for conformance at the method level. It would be a feature for apps that don't claim full conformance because they nevertheless give good results in more cases. The downside will be Python apps using str that will pass conformance tests written for, say Western Europe, but end users in Kuwait and Kuala Lumpur will report bugs.
Sure. I think that approach is fine for str, too, except that I would hope it looks up BMP base characters in the case-mapping database. The fact is that with very few exceptions non-BMP characters are going to be symbols (mathematical operators and emoticons, for example). This is good enough, except when it's not---but when it's not, only 100% conformance is really a reasonable target. IMO, of course.
I think we should just document how it behaves and not get hung up on what it is called. Mentioning UTF-16
If you also say, "this type can represent all characters in Unicode, as well as certain non-characters", why mention UTF-16 at all?
Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous)
OK, but those definitions need to be made clear, as "grapheme cluster" and "combined character" are defined in the Unicode standard, and in fact mean slightly different things from each other.
Clear enough.
No, and I can tell you why! The difference between characters and words is much more important than that between code point and grapheme cluster for most users and the developers who serve them. Even small children recognize typographical ligatures as being composite objects, while at least this Spanish-as-a-second-language learner was taught that `ñ' is an atomic character represented by a discontiguous glyph, like `i', and it is no more related to `n' than `m' is. Users really believe that characters are atomic. Even in the cases of Han characters and Hangul, users think of the characters as being "atomic," but in the sense of Bohr rather than that of Democritus. I think the situation for text processing is analogous to chemistry where the atom, with a few fairly gross properties (the outer electron orbitals) is the fundamental unit, not the elementary particles like electrons and protons and structures like inner orbitals. Sure, there are higher order structures like molecules, phases, and crystals, but it is elements that have the most regular and simply described behavior for the chemist, and it does not become any simpler for the chemist if you decompose the atom. The composed character or grapheme cluster is the analogue of the atom for most processing at the level of "text". The only real exceptions I can imagine are in the domain of linguistics.
Clear enough. I disagree that that will be enough for constructing large-scale Unicode-conformant applications. Somebody is going to have to produce batteries for those applications, and I think they should be included in Python. I agree that it's proper that I and those who think the same way take responsibility for writing and implementing a PEP.
I think that's like asking a toddler to know that the stove is hot. The consequences for the toddler of her ignorance are much greater, but the informational requirement is equally stringent. Of course application writers are adults who could be asked to learn, but economically I think it make a lot more sense to include those batteries. IMHO YMMV, obviously.
I want chr(i) to be valid for all values of i in range(2**21),
I quite agree (ie, for str). Thus I perceive a need for another type.

On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Where I cut your words, we are in 100% agreement. (FWIW :-)
Not quite the same here, but I don't feel the need to have the last word. Most of what you say makes sense, in some cases we'll quibble later, but there are a few points where I have something to add:
True -- in fact I didn't know that ff and ffl ligatures *existed* until I learned about Unix troff.
Ah, I think this may very well be culture-dependent. In Holland there are no Dutch words that use accented letters, but the accents are known because there are a lot of words borrowed from French or German. We (the Dutch) think of these as letters with accents and in fact we think of the accents as modifiers that can be added to any letter (at least I know that's how I thought about it -- perhaps I was also influenced by the way one had to type those on a mechanical typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-), when there are two consecutive vowels that would normally be read as a special sound (diphthong?). E.g. in "koe" (cow) the oe is two letters (not a single letter formed of two distict shapes!) that mean a special sound (roughly KOO). But in a word like "coëxistentie" (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? I'm also curious if ñ occurs as a separate character on Spanish keyboards. -- --Guido van Rossum (python.org/~guido)

Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
Indeed, they are not separate "letters" (they are considered the same in lexicographic order, and the French alphabet has 26 letters). But I'm not sure how it's relevant, because you can't remove an accent without most likely making a spelling error, or at least changing the meaning. Accents are very much part of the language (while ligatures like "ff" are not, they are a rendering detail). So I would consider "é", "ê", "ù", etc. atomic characters for the purpose of processing French text. And I don't see how a decomposed form could help an application. Regards Antoine.

On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
The example given was someone who didn't agree with how a particular font rendered those accented characters. I agree that's obscure though. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? Maybe it only was a compromise in the time of Morse code? -- --Guido van Rossum (python.org/~guido)

I think it is tolerated, partly because typing support (on computers and typewriters) has been weak. On a French keyboard, you have an "é" key, but shifting it gives you "2", not "É". The latter can be obtained using the Caps Lock key under Linux, but not under Windows. (so you could also write Éric's name "Eric", for example) That said, most typographies nowadays seem careful to keep the accents on uppercase letters (e.g. on book covers; AFAIR, road signs also keep the accents, but I'm no driver). Regards Antoine.

Antoine Pitrou, 01.09.2011 18:46:
AFAIR, road signs also keep the accents, but I'm no driver
Right, I noticed that, too. That's certainly not uncommon. I think it's mostly because of local pride (after all, the road signs are all that many drivers ever see of a city), but sometimes also because it can't be helped when the name gets a different meaning without accents. People just cause too many accidents when they burst out laughing while entering a city by car. Stefan

Guido van Rossum, 01.09.2011 18:31:
So does the German alphabet, even though that does not include "ß", which basically descended from a ligature of the old German way of writing "sz", where "s" looked similar to an "f" and "z" had a low hanging tail. IIRC, German Umlaut letters are lexicographically sorted according to their emergency replacement spelling ("ä" -> "ae"), which is also sometimes used in all upper case words ("Glück" -> "GLUECK"). I guess that's because Umlaut dots are harder to see on top of upper case letters. So, Latin-1 byte value sorting always yields totally wrong results. That aside, Umlaut letters are commonly considered separate letters, different from the undotted letters and also different from the replacement spellings. I, for one, always found the replacements rather weird and never got used to using them in upper case words. In any case, it's wrong to always use them, and it makes text harder to read.
Yes, and it's a huge problem when trying to pronounce last names. In French, you'd commonly write LASTNAME, Firstname and if LASTNAME happens to have accented letters, you'd miss them when reading that. I know a couple of French people who severely suffer from this, because the pronunciation of their name gets a totally different meaning without accents. Stefan

Guido van Rossum wrote:
This page features a number of French street signs in all-caps, and some of them have accents: http://www.happymall.com/france/paris_street_signs.htm -- Greg

Antoine Pitrou wrote:
On the other hand, the same doesn't necessarily apply to other languages. (At least according to Wikipedia.) http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_dia... -- Steven

On Sep 1, 2011, at 9:30 PM, Steven D'Aprano wrote:
For example, in Serbo-Croatian (Serbian, Croatian, Bosnian, Montenegrin, if you want), each of the following letters represent one distinct sound of the language. In Serbian Cyrillic alphabet, they are distinct symbols. In Latin alphabet, the corresponding letters are formed with diacritics because the alphabet is shorter. Letter Approximate pronunciation Cyrillic ------ ------------------------- -------- č tch in butcher ч ć ch in chapter, but softer ћ dž j in jump џ đ j in juice ђ š sh in ship ш ž s in pleasure, measure, ... ж The language has 30 sounds and the corresponding 30 letters. See the count of the letters in these tables: - http://hr.wikipedia.org/wiki/Hrvatska_abeceda - http://sr.wikipedia.org/wiki/Азбука Diacritics are used in grammar books and in print (occasionally) to distinguish between four different accents of the language: - long rising: á, - short rising: à, - long falling: ȃ (inverted breve, *not* a circumflex â), and - short falling: ȁ, especially when the words that use the same sounds -- thus, spelled with the same letters -- are next to each other. The accents are used to change the intonation of the whole word, not to change the sound of the letter. For example: "Ja sam sȃm." -- "I am alone." Both words "sam" contain the "a" sound, but the first one is pronounced short. As a form of the verb "to be" it's an enclitic that takes the accent of the preceding word "I". The second one is pronounced with a long falling accent. The macron can be used to indicate the length of a *non-stressed* vowel, e.g. ā, but is usually unnecessary in standard print. Many languages use alphabets that are not suitable to their sound system. The speakers of these languages adapted alphabets to their sounds either by using letters with distinct shapes (Cyrillic letters above), or adding diacritics to an existing shape (Latin letters above). The new combined form is a distinct letter. These letters have separate sections in dictionaries and a sorting order. The diacritics that indicate an accent or length are used only above vowels and do *not* represent distinct letters. Best regards, Zvezdan Petković P.S. Since I live in the USA, the last letter of my surname is *wrongly* spelled (ć -> c) and pronounced (ch -> k) most of the time. :-)

On 9/1/2011 11:45 AM, Guido van Rossum wrote:
typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-),
You remember correctly. According to https://secure.wikimedia.org/wikipedia/en/wiki/Trema_%28diacritic%29 'trema' (Greek 'hole') is the generic name of the double-dot vowel diacritic. It was originally used for 'diaerhesis' (Greek, 'taking apart') when it shows "that a vowel letter is not part of a digraph or diphthong". (Note that 'ae' in diaerhesis *is* a digraph ;-). Germans later used it to indicate umlaut, 'changed sound'.
So the above is trema-diaerhesis. "Dutch, French, and Spanish make regular use of the diaeresis." English uses such as 'coöperate' have become rare or archaic, perhaps because we cannot type them. Too bad, since people sometimes use '-' to serve the same purpose. -- Terry Jan Reedy

Terry Reedy wrote:
Too bad, since people sometimes use '-' to serve the same purpose.
Which actually seems more logical to me -- a separating symbol is better placed between the things being separated, rather than over the top of one of them! Maybe we could compromise by turning the diaeresis on its side: co:operate -- Greg

Guido van Rossum writes:
On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm not an expert, but I'm fairly sure it is. Specifically, I heard from a TeX-ie friend that the same accented letter is typeset (and collated) differently in different European languages because in some of them the accent is considered part of the letter (making a different character), while in others accents modify a single underlying character. The ones that consider the letter and accent to constitute a single character also prefer to leave less space, he said.
American English has the same usage, but it's optional (in particular, you'll see naive, naif, and words like coordinate typeset that way occasionally, for the same reason I suppose). As Hagen Fürstenau points out, with multiple combining characters, there are even more complex possibilities than "the accent is part of the character" and "it's really not", and they may be application- dependent.
You'd have to ask Mr. Gonzalez. I suspect he may have taught that way less because of his Castellano upbringing, and more because of the infamous lack of sympathy of American high school students for the fine points of usage in foreign languages.
I'm also curious if ñ occurs as a separate character on Spanish keyboards.
If I'm reading /usr/share/X11/xkb/symbols/es correctly, it does in X.org: the key that for English users would map to ASCII tilde.

If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not considered separately. http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol (I can't tell you how annoying to type "ñ" is when the tilde is accessed using AltGr + 2 and you have to combine that with the Compose key and N to obtain the full character. I'm sure Spanish keyboards have a better way than that :-)) Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote:
FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5f2UQACgkQ+gerLs4ltQ4URACePSZzpoPAg2IIYZewsjbuplkK 0MgAoM7VfdQHzjBiU6Vr/MYPJ9U2qC3M =pvKn -----END PGP SIGNATURE-----

On Thu, 01 Sep 2011 12:38:07 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
That Wikipedia article also says: “Los dígrafos Ch y Ll tienen valores fonéticos específicos, y durante los siglos XIX y XX se ordenaron separadamente de C y L, aunque la práctica se abandonó en 1994 para homogeneizar el sistema con otras lenguas.” -> roughly: “the "Ch" and "Ll" digraphs have specific phonetic values, and during the 19th and 20th centuries they were ordered separately from C and L, but this practice was abandoned in 1994 in order to make the system consistent with other languages.” And about "rr": “El dígrafo rr (llamado erre, /'ere/, y pronunciado /r/) nunca se consideró por separado, probablemente por no aparecer nunca en posición inicial.” -> “the "rr" digraph was never considered separate, probably because it never appears at the very beginning of a word.” Regards Antoine.

Tres Seaver writes:
FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.
That was always a Castellano vs. Americano issue, IIRC. As I wrote, Mr. Gonzalez was Castellano. I believe that the deprecation of the digraphs as separate letters occurred as the telephone became widely used in Spain, and the telephone company demanded an official proclamation from whatever Ministry is responsible for culture that it was OK to treat the digraphs as two letters (specifically, to collate them that way), so that they could use the programs that came with the OS. So this stuff is not merely variant by culture, but also by economics and politics. :-/

On 9/1/2011 11:59 PM, Stephen J. Turnbull wrote:
The main 'standards body' for Spanish is the Real Academia Española in Madrid, which works with the 21 other members of the Asociación de Academias de la Lengua Española. wikimedia.org/wikipedia/en/wiki/Real_Academia_Española .wikimedia.org/wikipedia/en/wiki/Association_of_Spanish_Language_Academies While it has apparently been criticized as 'conservative' (which is well ought to be), it has been rather progressive in promoting changes such as 'ph' to 'f' (fisica, fone) and dropping silent 'p' in leading 'psi' (sicologia) and silent 's' in leading 'sci' (ciencia). -- Terry Jan Reedy

Terry Reedy wrote:
I find it curious that pronunciation always seems to take precedence over spelling in campaigns like this. Nowadays, especially with the internet increasingly taking over from personal interaction, we probably see words written a lot more often than we hear them spoken. Why shouldn't we change the pronunciation to match the spelling rather than the other way around? -- Greg

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2011 11:59 PM, Stephen J. Turnbull wrote:
- From a casual web search, it looks as though the RAE didn't legislate "letterness" away from the digraphs (as I learned them) until 1994 (about 25 years after I learned the 30-letter alfabeto).
Lovely. :) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5hHswACgkQ+gerLs4ltQ7m9ACeOJZRgjcm9pd0Rnry26zP0I3t 53cAoLv78VD5eIdbjvboLaysoeREIp1t =0PuR -----END PGP SIGNATURE-----

Guido van Rossum writes:
In the original definition of UCS-2 in draft ISO 10646 (1990), everything in the BMP except for 0xFFFF and 0xFFFE was a character, and there was no concept of "surrogate" at all. Later in ISO 10646 (1993)[1], the Surrogate Area was carved out of the Private Area, but UCS-2 implementations simply treat them as (single) characters with special properties. This was more or less backward compatible as all corporate uses of the private area used the lower code points and didn't conflict with the surrogates. Finally (in 2000 or 2003) the definition of UCS-2 in ISO 10646 was revised in a backward- incompatible way to exclude surrogates entirely, ie, nowadays it is a range-restricted version of UTF-16. Footnotes: [1] IIRC, strictly speaking this was done slightly later (1993 or 1994) in an official Amendment to ISO 10646; the Amendment was incorporated into the standard in 2000.

On Sat, 27 Aug 2011 12:17:18 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
It also depends a lot on *actual* measured performance. As someone mentioned in the tracker, the index you use on a string usually comes from a previous string operation (like a search), perhaps with a small offset. So a caching scheme may actually give very good results with a rather small overhead (you could cache, say, the 4 most recent indices and choose the nearest when an indexing operation is done; with utf-8, scanning backward and forward is equally simple). Regards Antoine.

On 8/26/2011 8:23 PM, Antoine Pitrou wrote:
Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than O(n*logn). I even read that in an otherwise good book by a university professor. Fortunately for Python users, Tim Peters ignored that 'wisdom', coded the best O(n*n) sort he could, and then *measured* to find out what was better for what types and lengths of arrays. So not we have a list.sort that sometimes beats the pure O(nlog) quicksort of C libraries. -- Terry Jan Reedy

Terry Reedy wrote:
A nice story, but Quicksort's worst case is O(n*n) too. http://en.wikipedia.org/wiki/Quicksort timsort is O(n) in the best case (all items already in order). You are right though about Tim Peters doing extensive measurements: http://bugs.python.org/file4451/timsort.txt If you haven't read the whole thing, do so. I am in awe -- not just because he came up with the algorithm, but because of the discipline Tim demonstrated in such detailed testing. A far cry from a couple of timeit runs on short-ish lists. -- Steven

Paul Moore writes:
[...]
They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics.
Unfortunately, I don't think it's all that easy to defend. Absent PEP 393 or a restriction to the characters in the BMP, this is a very expensive change, easily visible to interactive users, let alone performance-hungry applications. I personally do advocate the "array of code points" definition, but I don't use IronPython or Jython so PEP 393 is as close to heaven as I expect to get. OTOH, I also use Emacsen with Mule, and I have to admit that there is a perceptible performance hit in any large (>1 MB) buffer containing non-ASCII characters vs. pure ASCII (the code unit in Mule is 1 byte). I expect that if IronPython and Jython really want to retain native, code-unit-based representations, it's going to be painful to conform to an "array of code points" specification. There may need to be a compromise of the form "Implementations SHOULD provide an implementation of str that is both O(1) in indexing and an array of code points. Code that is Unicode-ly correct in Python implementing PEP 393 will need to be ported with some effort to implementations that do not satisfy this requirement, perhaps using different algorithms or extra libraries."

On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Actually, the situation is that in narrow builds, they contain code units (which may have surrogates); in wide builds they contain code points. I think this is the crux of Tom Christian's complaints about narrow builds. Here's proof that narrow builds contain code units, not code points (i.e. use UTF-16, not UCS-2): $ ./python Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
It's pretty clear that the interpreter is surrogate-aware, which to me indicates the use of UTF-16. Now in the PEP 393 branch: ./python Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
And some proof that this branch does not care about surrogates:
However: a = '\ud808\udf45'
Which to me merely shows it is smart when parsing string literals. (I expect that regular 3.3 narrow builds behave similar to the 2.7 narrow build, and 3.3 wide builds behave similar to the pep-393 build; I didn't have those lying around.) -- --Guido van Rossum (python.org/~guido)

Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n).
Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor

On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.
It is O(1) after a one-time O(n) preproccessing, which is the same time order for creating the string in the first place. Anyway, I think the most important deficiency is with iteration:
LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C Traceback (most recent call last): File "<pyshell#9>", line 2, in <module> print(name(c)) ValueError: no such name This would work on wide builds but does not here (win7) because narrow build iteration produces a naked non-character surrogate code unit that has no specific entry in the Unicode Character Database. I believe that most new people who read "Strings contain Unicode characters." would expect string iteration to always produce the Unicode characters that they put in the string. The extra time per char needed to produce the surrogate pair that represents the character entered is O(1).
I presented O(log(number of non-BMP chars)) algorithms for indexing and slicing. For the mostly BMP case, that is hugely better than O(n).
PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint if there are any non-BMP chars. On my new machine, I should not mind that in exchange for correct behavior. -- Terry Jan Reedy

On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <tjreedy@udel.edu> wrote:
There are two reasons for this: 1) the str.is* methods get the string and return True/False, so it's enough to iterate on the string, combine the surrogates, and check if the result islower/upper/etc. Methods like lower/upper/etc, afaiu, currently get only a copy of the string, and modify that in place. The current macros advance to the next char during reading and writing, so it's not possible to use them to read/write from/to the same string. We could either change the macros to not advance the pointer [0] (and do it manually in the other functions like is*) or change the function to get the original string too. 2) I'm on vacation. Best Regards, Ezio Melotti [0]: for lower/upper/title it should be possible to modify the string in place, because these operations never converts a non-BMP char to a BMP one (and vice versa), so if two surrogates are read, two surrogates will be written after the transformation. I'm not sure this will work with all the methods though (e.g. str.translate).

[Apologies for sending out a long stream of pointed responses, written before I have fully digested this entire mega-thread. I don't have the patience today to collect them all into a single mega-response.] On Wed, Aug 24, 2011 at 10:45 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str).
Which reminds me. The PEP does not say what other Python implementations besides CPython should do. presumably Jython and IronPython will continue to use UTF-16, so presumably the language reference will still have to document that strings contain code units (not code points) and the objections Tom Christiansen raised against this will remain true for those versions of Python. (I don't know about PyPy, they can presumably decide when they start their Py3k port.) OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and we can lay the narrow build issues to rest? Can someone here speak for them? -- --Guido van Rossum (python.org/~guido)

Guido wrote:
The biggest difficulty for IronPython here would be dealing w/ .NET interop. We can certainly introduce either an IronPython specific string class which is similar to CPython's PyUnicodeObject or we could have multiple distinct .NET types (IronPython.Runtime.AsciiString, System.String, and IronPython.Runtime.Ucs4String) which all appear as the same type to Python. But when Python is calling a .NET API it's always going to return a System.String which is UTF-16. If we had to check and convert all of those strings when they cross into Python it would be very bad for performance. Presumably we could have a 4th type of "interop" string which lazily computes this but if we start wrapping .Net strings we could also get into object identity issues. We could stop using System.String in IronPython all together and say when working w/ .NET strings you get the .NET behavior and when working w/ Python strings you get the Python behavior. I'm not sure how weird and confusing that would be but conversion from an Ipy string to a .NET string could remain cheap if both were UTF-16, and conversions from .NET strings to Ipy strings would only happen if the user did so explicitly. But it's a huge change - it'll almost certainly touch every single source file in IronPython. I would think we'd get 3.2 done first and then think about what to do here.

Antoine Pitrou, 26.08.2011 12:51:
Why would PEP 393 apply to other implementations than CPython?
Not the PEP itself, just the implications of the result. The question was whether the language specification in a post PEP-393 can (and if so, should) be changed into requiring unicode objects to be defined based on code points. Narrow builds, as well as Jython and IronPython, currently deviate from this as they use UTF-16 as their native string encoding, which, for one, prevents O(1) indexing into characters as well as a direct match between length and character count (minus combining characters etc.). I think this discussion can safely be considered off-topic for this thread (which isn't exactly short enough to keep adding more topics to it). Stefan

Le vendredi 26 août 2011 02:01:42, Dino Viehland a écrit :
Python 3 encodes all Unicode strings to the OS encoding (and the result is decoded) for all syscalls and calls to libraries: to the locale encoding on UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. So Py_UNICODE* is already a UTF-16 string. I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for these calls is important or not. But on UNIX, pure ASCII string don't have to be encoded anymore if the locale encoding is UTF-8 or ASCII. IronPython can wait to see how CPython+PEP 383 handles these problems, and how slower it is.
But it's a huge change - it'll almost certainly touch every single source file in IronPython.
With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the string to UTF-16 (allocate memory, etc.). Except that applications should now check if an error occurred (check for NULL).
I would think we'd get 3.2 done first and then think about what to do here.
I don't think that IronPython needs to support non-BMP characters without using surrogates. Bug reports about non-BMP characters usually don't have use cases, but just want to make Python perfect. There is no need to hurry. PEP 393 tries to reduce the memory footprint. The effect on non-BMP character is just a *nice* border effect. Or was the PEP design to solve narrow build issues? Victor

I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does "." match a surrogate pair? Tom C suggests that Java's regex libraries get this and many other details right despite Java's use of UTF-16 to represent strings. So hopefully Jython's re library is built on top of Java's? PS. Is there a better contact for Jython? -- --Guido van Rossum (python.org/~guido)

On Fri, Aug 26, 2011 at 3:00 PM, Guido van Rossum <guido@python.org> wrote: the cc) - I'll do my best to answer though: Java 5 added a bunch of methods for dealing with Unicode that doesn't fit into 2 bytes - and looking at our code for our Unicode object, I see that we are using methods like the codePointCount method off of java.lang.String to compute length[1] and using similar methods all through that code to make sure we deal in code points when dealing with unicode. So it looks pretty good for us as far as I can tell. [1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePoint..., int) -Frank Wierzbicki

Oops, forgot to add the link for the gory details for Java and > 2 byte unicode: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

On 9/8/2011 6:15 PM, fwierzbicki@gmail.com wrote:
Oops, forgot to add the link for the gory details for Java and> 2 byte unicode:
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
This is dated 2004. Basically, they considered several options, tried out 4, and ended up sticking with char[] (sequences) as UTF-16 with char = 16 bit code unit and added 32-bit Character(int) class for low-level manipulation of code points. I did not see the indexing problem mentioned. I get the impression that they encourage sequence forward-backward iteration (cursor-based access) rather than random-access indexing. -- Terry Jan Reedy

On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy <tjreedy@udel.edu> wrote:
There aren't docs, but the code is here: https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU... Here are (I think) the most relevant bits for random access -- note that getString() returns the internal representation of the PyUnicode which is a java.lang.String @Override protected PyObject pyget(int i) { if (isBasicPlane()) { return Py.makeCharacter(getString().charAt(i), true); } int k = 0; while (i > 0) { int W1 = getString().charAt(k); if (W1 >= 0xD800 && W1 < 0xDC00) { k += 2; } else { k += 1; } i--; } int codepoint = getString().codePointAt(k); return Py.makeCharacter(codepoint, true); } public boolean isBasicPlane() { if (plane == Plane.BASIC) { return true; } else if (plane == Plane.UNKNOWN) { plane = (getString().length() == getCodePointCount()) ? Plane.BASIC : Plane.ASTRAL; } return plane == Plane.BASIC; } public int getCodePointCount() { if (codePointCount >= 0) { return codePointCount; } codePointCount = getString().codePointCount(0, getString().length()); return codePointCount; } -Frank

I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about this affecting their app's performance? --Guido On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Well, I'd be interesting how it goes, since if Jython users find this acceptable then maybe we shouldn't be quite so concerned about it for CPython... On the third hand we don't have working code for this approach in CPython, while we do have working code for the PEP 393 solution... --Guido On Fri, Sep 9, 2011 at 3:38 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 9/9/2011 5:21 PM, Guido van Rossum wrote:
The question is whether or how often any Jython users are yet indexing/slicing long strings with astral chars. If a utf-8 xml file is directly parsed into a DOM, then the longest decoded strings will be 'paragraphs' that are seldom more than 1000 chars.
This is O(1)
This is an O(n) linear scan.
Near the beginning of this thread, I described and gave a link to my O(logk) algorithm, where k is the number of supplementary ('astral') chars. It uses bisect.bisect_left on an int array of length k constructed with a linear scan much like the one above, with one added line. The basic idea is to do the linear scan just once and save the locations (code point indexes) of the astral chars instead of repeating the scan on every access. That could be done as the string is constructed. The same array search works for slicing too. Jython is welcome to use it if you ever decide you need it. I have in mind to someday do some timing tests with the Python version. I just do not know how closely results would be to those for compiled C or Java. -- Terry Jan Reedy

On Tue, Aug 23, 2011 at 08:15, Antoine Pitrou <solipsis@pitrou.net> wrote:
So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.
The WRITE_FLEXIBLE_OR_WSTR macro does a check for kind and then writes. Using this macro for the fast path would be inefficient, to have a real fast path, you would need a outer if to check for kind and then in each condition body the matching access to the string (1, 2, or 4 bytes) and for each body also write 4 or 8 times (guarded by #ifdef, depending on platform). As all these cases bloated up the C code, we went for the simple solution with the goal of profiling the code again afterwards to see where the new performance bottlenecks would be.
To me this feels like this would complicate the C source code and decrease readability. For each function you would need a wrapper which does the kind checking logic and then, in a separate file, the implementation of the function which then gets included three times for each character width. Regards, Torsten

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:
I fully support the declared purpose of the PEP, which I understand to be to have a full,correct Unicode implementation on all new Python releases without paying unnecessary space (and consequent time) penalties. I think the erroneous length, iteration, indexing, and slicing for strings with non-BMP chars in narrow builds needs to be fixed for future versions. I think we should at least consider alternatives to the PEP393 solution of double or quadrupling space if needed for at least one char. In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. I believe the same idea would work for utf8 and the mostly-ascii case. The main difference is that non-ascii chars have various byte sizes rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So the offset correction would not simply be the bisect-left return but would require another lookup byte-index = char-index + offsets[bisect-left(cpdex, char-index)] If possible, I would have the with-index-array versions be separate subtypes, as in utf16.py. I believe either index-array implementation might benefit from a subtype for single multi-unit chars, as a single non-ASCII or non-BMP char does not need an auxiliary [0] array and a senseless lookup therein but does need its length fixed at 1 instead of the number of base array units. -- Terry Jan Reedy

On 8/23/2011 5:46 PM, Terry Reedy wrote:
So am I correctly reading between the lines when, after reading this thread so far, and the complete issue discussion so far, that I see a PEP 393 revision or replacement that has the following characteristics: 1) Narrow builds are dropped. The conceptual idea of PEP 393 eliminates the need for narrow builds, as the internal string data structures adjust to the actuality of the data. If you want a narrow build, just don't use code points > 65535. 2) There are more, or different, internal kinds of strings, which affect the processing patterns. Here is an enumeration of the ones I can think of, as complete as possible, with recognition that benchmarking and clever algorithms may eliminate the need for some of them. a) all ASCII b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This kind may not be able to support a "mostly" variation, and may be no more efficient than case b). But it might also be popular in parts of Europe :) And appropriate benchmarks may discover whether or not it has worth. c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient e) 16-bit codepoints f) UTF-16 with clever indexing/caching to be efficient g) 32-bit codepoints h) UTF-32 When instantiating a str, a new parameter or subtype would restrict the implementation to using only a), b), d), f), and h) when fully conformant Unicode behavior is desired. No lone surrogates, no out of range code points, no illegal codepoints. A default str would prefer a), b), c), e), and g) for efficiency and flexibility. When manipulations outside of Unicode are necessary [Windows seems to use e) for example, suffering from the same sorts of backward compatibility problems as Python, in some ways], the default str type would permit them, using e) and g) kinds of representations. Although the surrogate escape codec only uses prefix surrogates (or is it only suffix ones?) which would never match up, note that a conversion from 16-bit codepoints to other formats may produce matches between the results of the surrogate escape codec, and other unchecked data introduced by the user/program. A method should be provided to validate and promote a string from default, unchecked str type to the subtype or variation that enforces Unicode, if it qualifies; if it doesn't qualify, an exception would be raised by the method. (This could generally be done in place if the value is bound to only a single variable, but would generate a copy and rebind the variable to the promoted copy if it is multiply referenced?) Another parameter or subtype of the conformant str would add grapheme support, which has a different set of rules for the clever indexing/caching, but could be applied to any of a)†, c)†, d), f), or h). † It is unnecessary to apply clever indexing/caching to a) and c) kinds of string internals, because there is a one-to-one mapping between bytes, codepoints, and graphemes in these ranges. So plain array indexing can be used in the implementation of these kinds.

PEP 393 already drops narrow builds.
2) There are more, or different, internal kinds of strings, which affect the processing patterns.
This is the basic idea of PEP 393.
This two cases are already in PEP 393.
c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient
I see neither a need nor a means to consider these.
e) 16-bit codepoints
These are in PEP 393.
f) UTF-16 with clever indexing/caching to be efficient
Again, -1.
g) 32-bit codepoints
This is in PEP 393.
h) UTF-32
What's that, as opposed to g)? I'm not open to revise PEP 393 in the direction of adding more representations. Regards, Martin

On 8/24/2011 1:18 AM, "Martin v. Löwis" wrote:
I'd forgotten that.
Agreed.
Sure. Wanted to enumerate all, rather than just add-ons.
The discussion about "mostly ASCII" strings seems convincing that there could be a significant space savings if such were implemented.
This is probably the one I would pick as least likely to be useful if the rest were implemented.
g) would permit codes greater than u+10ffff and would permit the illegal codepoints and lone surrogates. h) would be strict Unicode conformance. Sorry that the 4 paragraphs of explanation that you didn't quote didn't make that clear.
I'm not open to revise PEP 393 in the direction of adding more representations.
It's your PEP.

Le 24/08/2011 11:22, Glenn Linderman a écrit :
Antoine's optimization in the UTF-8 decoder has been removed. It doesn't change the memory footprint, it is just slower to create the Unicode object. When you decode an UTF-8 string: - "abc" string uses "latin1" (8 bits) units - "aé" string uses "latin1" (8 bits) units <= cool! - "a€" string uses UCS2 (16 bits) units - "a\U0010FFFF" string uses UCS4 (32 bits) units Victor

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string. As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint. It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support. Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency. On the other hand, there are large subsets of applications that simply do not require support for bidirectional text or composed characters, and for those that do not, it remains to be seen if the price to be paid for supporting those features is too high a price for such applications. So far, we don't have implementations to benchmark to figure that out! What does this mean for Python? Well, if Python is willing to limit its support for applications to the subset for which the "big char" solution sufficient, then PEP 393 provides a way to do that, that looks to be pretty effective for reducing memory consumption for those applications that use short strings most of which can be classified by content into the 1 byte or 2 byte representations. Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications. Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed. It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, but Martin has stated he is not open to additional representations, so the PEP itself cannot be that basis (although with care which may or may not be taken in the implementation of the PEP, the implementation may still provide that basis).

On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
(PEP 393, I presume. :-)
As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint.
But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs).
I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level.
There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much. OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds.
I think you are saying that many apps can ignore the distinction between codepoints and characters. Given the complexity of bidi rendering and normalization (which will always remain an issue) I agree; this is much less likely to be a burden than the narrow-build issues with code units vs. codepoints. What should the stdlib do? It should try to skirt the issue where it can (using the garbage-in-garbage-out principle) and advertise what it supports where there is a difference. I don't see why all the stdlib should be made aware of multi-codepoint-characters and other bidi requirements, but it should be clear to the user who has such requirements which stdlib operations they can safely use.
This seems more of an intuition than a fact. I could easily imagine the facts being that even for large strings, usually either there are no outliers, or there is a significant number of outliers. (E.g. Tom Christiansen's OSCON preso falls in the latter category :-). As long as it *works* I don't really mind that there are some extreme cases that are slow. You'll always have that.
Yeah, it's a learning process.
There is always the possibility of representations that are defined purely by userland code and can only be manipulated by that specific code. But expecting C extensions to support new representations that haven't been defined yet sounds like a bad idea. -- --Guido van Rossum (python.org/~guido)

On 8/24/2011 12:34 PM, Guido van Rossum wrote:
This statement might yet be made true :)
In the docs, yes, and in programmer's minds (influenced by docs).
Me neither.
Yes, this is true. In one sense, though, since UTF-8-supporting code already has to deal with variable length codepoint encoding, support for variable length character encoding seems like a minor extension, not upsetting any concept of fixed-width optimizations, because such cannot be used.
Programmers that have to deal with bidi or composed characters shouldn't have such expectations, of course. But there are many programmers who do not, or at least who think they do not, and they can retain their O(1) expectations, I suppose, until it bites them.
Yep, the only justification for narrow builds is in interfacing to underlying broken OS that happen to use that encoding... it might be slightly more efficient when doing API calls to such an OS. But most interesting programs do much more than I/O.
It would seem helpful if the stdlib could have some support for efficient handling of Unicode characters in some representation. It would help address the class of applications that does care. Adding extra support for Unicode character handling sooner rather than later could be an performance boost to applications that do care about full character support, and I can only see the numbers of such applications increasing over time. Such could be built as a subtype of str, perhaps, but if done in Python, there would likely be a significant performance hit when going from str to "unicodeCharacterStr".
Yes, it is intuition, regarding memory consumption. It is not at all clear how different the "occasional outlier character" is than your "significant number of outliers". Tom's presentation certainly was regarding bodies of text which varied from ASCII to fully non-ASCII. The memory characteristics of long string handling would certainly be non-intuitive, when you can process a file of size N with a particular program, but can't process a smaller file because it has a funny character in it, and suddenly you are out of space.
While they can and should be prototyped in Python for functional correctness, I would rather expect such representations to be significantly slower in Python than in C. But that is just intuition also. The PEP makes a nice extension to str representations, but I'm not sure it picks the most useful ones, in that while it is picking cases that are well understood and are in use, they may not be the most effective ones (due to the strange memory consumption characteristics that outliers can introduce). My intuition says that a UTF-8 representation (or Tom's/Perl's looser utf8) would be a handy representation to have. But maybe it should be a different type than str... str8? I suppose that is -ideas land.

On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman <v+python@g.nevcal.com> wrote:
I claim that we have insufficient understanding of their needs to put anything in the stdlib. Wait and see is a good strategy here.
Sounds like overengineering to me. The right time to add something to the stdlib is when a large number of apps *currently* need something, not when you expect that they might need it in the future. (There just are too many possible futures to plan for them all. YAGNI rules.) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum writes:
In fact, the Unicode Standard, Version 6, goes farther (to code units): 2.7 Unicode Strings A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. (p. 32).

On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.) -- --Guido van Rossum (python.org/~guido)

Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
UTF-8 can use more space than latin1 or UCS2:
UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters (or few non-BMP characters). About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) (PEP 393 indexing).
In these worst cases, the PEP 393 is not worse than the current implementation: it just as much memory than Python in wide mode (mode used on Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of Python in narrow mode (Windows). I agree than UTF-8 is better in these corner cases, but I also bet than most Python programs will use less memory and will be faster with the PEP 393. You can already try the pep-393 branch on your own programs.
I used stringbench and "./python -m test test_unicode". I plan to try iobench. Which other benchmark tool should be used? Should we write a new one?
I don't think that the *default* Unicode type is the best place for this. The base Unicode type has to be *very* efficient. If you have unusual needs, write your own type. Maybe based on the base type? Victor

On 25 August 2011 07:10, Victor Stinner <victor.stinner@haypocalc.com>wrote:
I think that the PyPy benchmarks (or at least selected tests such as slowspitfire) would probably exercise things quite well. http://speed.pypy.org/about/ Tim Delaney

Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
I posted a patch to re-add it: http://bugs.python.org/issue12819#msg142867 Victor

On Tue, Aug 23, 2011 at 18:27, Victor Stinner <victor.stinner@haypocalc.com> wrote:
I posted a patch to re-add it: http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this patch adds the fast path only to the helper function which determines the length of the string and the maximum character. The decoding part is still without a fast path for ASCII runs. Regards, Torsten

Le 24/08/2011 04:41, Torsten Becker a écrit :
Ah? If utf8_max_char_size_and_has_errors() returns no error hand maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-) maxchar = utf8_max_char_size_and_has_errors(s, size, &unicode_size, &has_errors); if (has_errors) { ... } else { unicode = (PyUnicodeObject *)PyUnicode_New(unicode_size, maxchar); if (!unicode) return NULL; /* When the string is ASCII only, just use memcpy and return. */ if (maxchar < 128) { assert(unicode_size == size); Py_MEMCPY(PyUnicode_1BYTE_DATA(unicode), s, unicode_size); return (PyObject *)unicode; } ... } But yes, my patch only optimize ASCII only strings, not "mostly-ASCII" strings (e.g. 100 ASCII + 1 latin1 character). It can be optimized later. I didn't benchmark my patch. Victor

Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
Some raw numbers. stringbench: "147.07 203.07 72.4 TOTAL" for the PEP 393 "146.81 140.39 104.6 TOTAL" for default => PEP is 45% slower run test_unicode 50 times: 0m19.487s for PEP 0m17.187s for default => PEP is 13% slower time ./python -m test -j4 ("real" time): 3m16.886s (334 tests) for the PEP 3m21.984s (335 tests) for default ... default has 1 more test! Only 13% slower on test_unicode is *good*. There are still a lot of code using the legacy API in unicode.c, so it cam be much better. stringbench only shows the overhead of the conversion from compact unicode to Py_UNICODE* (wchar_t*). stringlib does still use the legacy API. Victor

On 8/23/2011 6:38 PM, Victor Stinner wrote:
I ran the same benchmark and couldn't make a distinction in performance between them: pep-393.txt 182.17 175.47 103.8 TOTAL cpython.txt 183.26 177.97 103.0 TOTAL pep-393-wide-unicode.txt 181.61 198.69 91.4 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL I ran it a couple times and have seen either default or pep-393 being up to +/- 10 sec slower on the unicode tests. The results of the 8-bit string tests seem to have less variance on my test machine.
$ time ./python -m test `python -c 'print "test_unicode " * 50'` pep-393-wide-unicode.txt real 0m33.409s cpython-wide-unicode.txt real 0m33.489s Nothing in it for me.. except your system is obviously faster, in general. -- Scott Dial scott@scottdial.com

On 8/24/2011 4:11 AM, Victor Stinner wrote:
You are right. I used the "Get Source" link on bitbucket to save pulling the whole clone, but the "Get Source" link seems to be whatever branch has the lastest revision (maybe?) even if you switch branches on the webpage. To correct my previous post: cpython.txt 183.26 177.97 103.0 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL pep-393.txt 181.40 270.34 67.1 TOTAL And, cpython.txt real 0m32.493s cpython-wide-unicode.txt real 0m33.489s pep-393.txt real 0m36.206s -- Scott Dial scott@scottdial.com

On Mon, Aug 22, 2011 at 18:14, Antoine Pitrou <solipsis@pitrou.net> wrote:
- You could trim the debug results from the benchmark results, this may make them more readable.
Good point, I removed them from the wiki page. On Tue, Aug 23, 2011 at 18:38, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Thank you Victor for running stringbench, I did not get to it in time. Regards, Torsten

Torsten Becker, 22.08.2011 20:58:
Very cool! I've started fixing up Cython for it. One thing I noticed: on platforms where wchar_t is signed, the comparison to "128U" in the Py_UNICODE_ISSPACE() macro may issue a warning when applied to a Py_UNICODE value (which it previously was officially defined on). For the sake of portability of existing code, this may be worth a work-around. Personally, I wouldn't really mind getting this warning, given that it's better to use Py_UCS4 instead of Py_UNICODE. But it may turn out to be an annoyance for users, because their code that does this isn't actually broken in the new world. And one thing that I find unfortunate is that we need a new (unexpected) _GET_LENGTH() next to the existing (and obvious) _GET_SIZE(), but I guess that's a somewhat small price to pay for backwards compatibility... Stefan

Torsten Becker, 22.08.2011 20:58:
One thing that occurred to me regarding the object struct: typedef struct { PyObject_HEAD Py_ssize_t length; /* Number of code points in the string */ void *str; /* Canonical, smallest-form Unicode buffer */ Py_hash_t hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this * object are *not* counted in ob_refcnt. * See SSTATE_KIND_* for other bits */ Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* UTF-8 representation (null-terminated) */ Py_ssize_t wstr_length; /* Number of code points in wstr, possible * surrogates count as two code points. */ wchar_t *wstr; /* wchar_t representation (null-terminated) */ } PyUnicodeObject; Wouldn't the "normal" approach be to use a union for the str field? I.e. union str { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; } Given that they're all pointers, all fields have the same size, but I find it more readable to write u.str.latin1 than ((const unsigned char*)u.str) Plus, the three types would be given by the struct, rather than by a per-usage cast. Has this been considered before? Was there a reason to decide against it? Stefan

Has this been considered before? Was there a reason to decide against it?
I think we simply didn't consider it. An early version of the PEP used the lower bits for the pointer to encode the kind, in which case it even stopped being a pointer. Modules are not expected to access this pointer except through the macros, so it may not matter that much. OTOH, it's certainly not too late to change it. Regards, Martin

"Martin v. Löwis", 23.08.2011 15:17:
The difference is that you *could* access them directly in a safe way, if it was a union. So, for an efficient character loop, replicated for performance reasons or for character range handling reasons or whatever, you could just check the string kind and then jump to the loop implementation that handles that type, without using any further macros. Stefan

On Tue, 23 Aug 2011 16:02:54 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
Macros are useful to shield the abstraction from the implementation. If you access the members directly, and the unicode object is represented differently in some future version of Python (say e.g. with tagged pointers), your code doesn't compile anymore. Regards Antoine.

Antoine Pitrou, 23.08.2011 16:08:
Even with tagged pointers, you could just provide a macro that unpacks the pointer to the buffer for a given string kind. I don't think there's much more to be done to keep up the abstraction. I don't see a reason to prevent users from accessing the memory buffer directly, especially not by (accidental, as I understand it) obfuscation through a void*. Stefan

Even with tagged pointers, you could just provide a macro that unpacks the pointer to the buffer for a given string kind.
These macros are indeed available.
It's not about preventing them from accessing the representation. It's an "internal public" structure just as all other object layouts (i.e. feel free to use them, but expect them to change with the next release). However, I still think that people rarely will: - most code treats strings as opaque, just as any other PyObject* - code that is aware of strings typically wants them in an encoded form, often UTF-8, or whatever the underlying C library expects. - code that does need to look at individual characters should be fine with the accessor macros. That said, I can readily believe that Cython would have a use for direct access to the structure. I just wouldn't want people to rewrite their code in four versions (three for the different 3.3 representations, plus one for 3.2 and earlier). Regards, Martin

On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou <solipsis@pitrou.net> wrote:
I agree with Antoine, from the experience of porting C code from 3.2 to the PEP 393 unicode API, the additional encapsulation by macros made it much easier to change the implementation of what is a field, what is a field's actual name, and what needs to be calculated through a function. So, I would like to keep primary access as a macro but I see the point that it would make the struct clearer to access and I would not mind changing the struct to use a union. But then most access currently is through macros so I am not sure how much benefit the union would bring as it mostly complicates the struct definition. Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL"). Regards, Torsten

Torsten Becker, 24.08.2011 04:41:
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
You could just add yet another field "any", i.e. union { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; void* any; } str; That way, the above test becomes if (!unicode->str.any) or if (unicode->str.any == NULL) Or maybe even call it "initialised" to match the intended purpose: if (!unicode->str.initialised) That being said, I don't mind "unicode->str.latin1 == NULL" either, given that it will (as mentioned by others) be hidden behind a macro most of the time anyway. Stefan

Le 24/08/2011 04:41, Torsten Becker a écrit :
An union helps debugging in gdb: you don't have to cast manually to unsigned char*/Py_UCS2*/Py_UCS4*.
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
We can rename "str" to something else, to "data" for example. Victor

On Tue, Aug 23, 2011 at 7:41 PM, Torsten Becker <torsten.becker@gmail.com> wrote:
+1
Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL").
You could add an extra union field for that: unicode->str.voidptr == NULL -- --Guido van Rossum (python.org/~guido)

Le lundi 22 août 2011 20:58:51, Torsten Becker a écrit :
state: lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2 next 2 bits (mask 0x0C) - form of str: 00 => reserved 01 => 1 byte (Latin-1) 10 => 2 byte (UCS-2) 11 => 4 byte (UCS-4); next bit (mask 0x10): 1 if str memory follows PyUnicodeObject kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). Victor

Le mercredi 24 août 2011 00:46:16, Victor Stinner a écrit :
If it can be removed, it would be nice to have kind in [0; 2] instead of kind in [1; 2], to be able to have a list (of 3 items) => callback or label. I suppose that compilers prefer a switch with all cases defined, 0 a first item and contiguous values. We may need an enum. Victor

On Tue, Aug 23, 2011 at 18:56, Victor Stinner <victor.stinner@haypocalc.com> wrote:
It is also used in PyUnicode_DecodeUTF8Stateful() and there might be some cases which I missed converting checks for 0 when I introduced the macro. The question was more if this should be written as 0 or as a named constant. I preferred the named constant for readability. An alternative would be to have kind values be the same as the number of bytes for the string representation so it would be 0 (wstr), 1 (1-byte), 2 (2-byte), or 4 (4-byte). I think the value for wstr/uninitialized/reserved should not be removed. The wstr representation is still used in the error case in the utf8 decoder because these strings can be resized. Also having one designated value for "uninitialized" limits comparisons in the affected functions to the kind value, otherwise they would need to check the str field for NULL to determine in which buffer to write a character.
I suppose that compilers prefer a switch with all cases defined, 0 a first item and contiguous values. We may need an enum.
During the Summer of Code, Martin and I did a experiment with GCC and it did not seem to produce a jump table as an optimization for three cases but generated comparison instructions anyway. I am not sure how much we should optimize for potential compiler optimizations here. Regards, Torsten

Le 24/08/2011 04:56, Torsten Becker a écrit :
Please don't do that: it's more common to need contiguous arrays (for a jump table/callback list) than having to know the character size. You can use an array giving the character size: CHARACTER_SIZE[kind] which is the array {0, 1, 2, 4} (or maybe sizeof(wchar_t) instead of 0 ?).
In Python, you can resize an object if it has only one reference. Why is it not possible in your branch? Oh, I missed the UTF-8 decoder because you wrote "kind = 0": please, use PyUnicode_WCHAR_KIND instead! I don't like "reserved" value, especially if its value is 0, the first value. See Microsoft file formats: they waste a lot of space because most fields are reserved, and 10 years later, these fields are still unused. Can't we add the value 4 when we will need a new kind?
I have to read the code more carefully, I don't know this "uninitialized" state. For kind=0: "wstr" means that str is NULL but wstr is set? I didn't understand that str can be NULL for an initialized string. I should read the PEP again :-)
You mean with a switch with a case for each possible value? I don't think that GCC knows that all cases are defined if you don't use an enum.
I am not sure how much we should optimize for potential compiler optimizations here.
Oh, it was just a suggestion. Sure, it's not the best moment to care of micro-optimizations. Victor

If you use the new API to create a string (knowing how many characters you have, and what the maximum character is), the Unicode object is allocated as a single memory block. It can then not be resized. If you allocate in the old style (i.e. giving NULL as the data pointer, and a length), it still creates a second memory blocks for the Py_UNICODE[], and allows resizing. When you then call PyUnicode_Ready, the object gets frozen.
I don't get the analogy, or the relationship with the value 0. "Reserving" the value 0 is entirely different from reserving a field. In a field, it wastes space; the value 0 however fills the same space as the values 1,2,3. It's just used to denote an object where the str pointer is not filled out yet, i.e. which can still be resized.
No, a computed jump on the assembler level. Consider this code enum kind {null,ucs1,ucs2,ucs4}; void foo(void *d, enum kind k, int i, int v) { switch(k){ case ucs1:((unsigned char*)d)[i] = v;break; case ucs2:((unsigned short*)d)[i] = v;break; case ucs4:((unsigned int*)d)[i] = v;break; } } gcc 4.6.1 compiles this to foo: .LFB0: .cfi_startproc cmpl $2, %esi je .L4 cmpl $3, %esi je .L5 cmpl $1, %esi je .L7 .p2align 4,,5 rep ret .p2align 4,,10 .p2align 3 .L7: movslq %edx, %rdx movb %cl, (%rdi,%rdx) ret .p2align 4,,10 .p2align 3 .L5: movslq %edx, %rdx movl %ecx, (%rdi,%rdx,4) ret .p2align 4,,10 .p2align 3 .L4: movslq %edx, %rdx movw %cx, (%rdi,%rdx,2) ret .cfi_endproc As you can see, it generates a chain of compares, rather than an indirect jump through a jump table. Regards, Martin
participants (29)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Dino Viehland
-
Ethan Furman
-
Ezio Melotti
-
fwierzbicki@gmail.com
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Hagen Fürstenau
-
Isaac Morland
-
M.-A. Lemburg
-
Neil Hodgson
-
Nick Coghlan
-
Paul Moore
-
Raymond Hettinger
-
Scott Dial
-
Stefan Behnel
-
Stephen J. Turnbull
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Torsten Becker
-
Tres Seaver
-
Victor Stinner
-
Xavier Morel
-
Zvezdan Petkovic
-
Éric Araujo