Mailman 3 len(chr(i)) = 2? - Python-Dev

newer
Re: [Python-Dev] [Python-checkins]...

len(chr(i)) = 2?

older
Re: [Python-Dev] [Python-checkins]...

Alexander Belopolsky

19 Nov 2010 19 Nov '10

8:53 a.m.

I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1] Note that I am not arguing that the change was bad. In Python 2.x, \U escapes have been producing surrogate pair on narrow builds for a long time if not since introduction of unicode. I do believe, however that a change like this [2] and its consequences should be better publicized. I have not found any discussion of this change in PEPs or "What's new" documents. The closest find was a mentioning of a related issue #3280 in the 3.0 NEWS file. [3] Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in "What's new in 3.2"? [1] http://bugs.python.org/issue7828 [2] http://svn.python.org/view?view=rev&revision=56395 [3] http://www.python.org/download/releases/3.0.1/NEWS.txt

Show replies by date

Antoine Pitrou

19 Nov 19 Nov

9:17 a.m.

On Fri, 19 Nov 2010 11:53:58 -0500 Alexander Belopolsky wrote:

...

Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in "What's new in 3.2"?

No, since it's not new in 3.2. No need to further confuse users. If there's a porting guide to 3.x it should be mentioned there. Regards Antoine.

Victor Stinner

12:23 p.m.

Hi, On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:

...

I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x.

Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode == 1114111).

...

I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1]

It was a documentation bug and you fixed it. Non-BMP characters are rare, so few (maybe only you?) noticed the documentation bug. I consider the behaviour as an improvment of non-BMP support of Python3. Python is unclear about non-BMP characters: narrow build was called "ucs2" for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words). Python2 accepts non-BMP characters with \U syntax, but not with chr(). This is inconsistent and I see this as a bug. But I don't want to touch Python2 about non-BMP characters, and the "bug" is already fixed in Python3!

...

I do believe, however that a change like this [2] and its consequences should be better publicized.

Change made before the release of Python 3.0. Do you want to patch the "What's new in Python 3.0?" document?

...

I have not found any discussion of this change in PEPs or "What's new" documents. The closest find was a mentioning of a related issue #3280 in the 3.0 NEWS file. [3] Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in "What's new in 3.2"?

In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like "backward compatibility" or "historical reasons" :-) Victor

"Martin v. Löwis"

1:43 p.m.

...

In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like "backward compatibility" or "historical reasons" :-)

No, there was a deliberate decision to not support that, see http://www.python.org/dev/peps/pep-0261/ There had been a long discussion on this specific detail when PEP 261 was written, and in the end, an explicit, deliberate, considered decision was made to raise a ValueError. Regards, Martin

Alexander Belopolsky

21 Nov 21 Nov

2:13 p.m.

On Fri, Nov 19, 2010 at 4:43 PM, "Martin v. Löwis" wrote:

...

...
In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like "backward compatibility" or "historical reasons" :-)

No, there was a deliberate decision to not support that, see

http://www.python.org/dev/peps/pep-0261/

There had been a long discussion on this specific detail when PEP 261 was written, and in the end, an explicit, deliberate, considered decision was made to raise a ValueError.

Yes, the existence of PEP 261 was one of the reasons I was surprised that a change like this was made without a deliberation. Personally, I've never used chr() or ord() other than on the python command prompt. Processing text one character at a time is just too slow in Python. So for my own use cases, the change is quite welcome. I also find that with bytes() items being int in 3.x more or less removes the need for ord(). On the other hand any 2.x program that uses unichr() and ord() is very likely to exhibit subtly buggy behavior when ported to 3.x. I don't think len(chr(i)) = 2 is likely to cause problems, but map(ord, s) not being an iterator over code points is likely to break naive programs. This is especially true because as far as I can tell there is no easy way to iterate over code points in a Python string on a narrow build.

M.-A. Lemburg

19 Nov 19 Nov

2:25 p.m.

Victor Stinner wrote:

...

Hi,

On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:

...
I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x.

Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode == 1114111).

...
I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1]

It was a documentation bug and you fixed it. Non-BMP characters are rare, so few (maybe only you?) noticed the documentation bug. I consider the behaviour as an improvment of non-BMP support of Python3.

Python is unclear about non-BMP characters: narrow build was called "ucs2" for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words).

No, no, no :-) UCS2 and UCS4 are more appropriate than "narrow" and "wide" or even "UTF-16" and "UTF-32". It'S rather common to confuse a transfer encoding with a storage format. UCS2 and UCS4 refer to code units (the storage format). You can use UCS2 and UCS4 code units to represent UTF-16 and UTF-32 resp., but those are not the same things. In UTF-16 0xD800 has a special meaning, in UCS2 it doesn't. Python uses UCS2 internally. It does not assign a special meaning to those surrogate code point ranges. However, when it comes to codecs, we do try to make use of the fact that UCS2 can easily be used to represent an UTF-16 encoding and that's why you often see surrogates being created for code points that wouldn't otherwise fit into UCS2 and you see those surrogates being converted back to single code units in UCS4 builds. I don't know who invented the terms "narrow" and "wide" builds for Python3. Not me that's for sure :-) They don't have any meaning in Unicode terminology and thus cause even more confusion than UCS2 and UCS4. E.g. the import errors you get when importing extensions built for a different Unicode version, (correctly) refer to UCS2 vs. UCS4 and now give even less of a clue that they relate to difference in Unicode builds (since these are now labeled "narrow" and "wide"). IMO, we should go back to the Python2 terms UCS2 and UCS4 which are correct and provide a clear description of what Python uses internally for code units.

...

Python2 accepts non-BMP characters with \U syntax, but not with chr(). This is inconsistent and I see this as a bug. But I don't want to touch Python2 about non-BMP characters, and the "bug" is already fixed in Python3!

...
I do believe, however that a change like this [2] and its consequences should be better publicized.

Change made before the release of Python 3.0. Do you want to patch the "What's new in Python 3.0?" document?

Perhaps add a section "What we forgot to mention in 3.0" or "What's not so new in 3.2" to "What's new in 3.2" :-)

...

...
I have not found any discussion of this change in PEPs or "What's new" documents. The closest find was a mentioning of a related issue #3280 in the 3.0 NEWS file. [3] Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in "What's new in 3.2"?

In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like "backward compatibility" or "historical reasons" :-)

Backwards compatibility. Python2 applications don't expect unichr(i) to return anything other than a single character. If you need this in Python2, it's easy enough to get around, though, with a little helper function. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 19 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

"Martin v. Löwis"

2:46 p.m.

...

It'S rather common to confuse a transfer encoding with a storage format. UCS2 and UCS4 refer to code units (the storage format).

Actually, they don't. Instead, they refer to "coded character sets", in W3C terminology: mapping of characters to natural numbers. See http://unicode.org/faq/basic_q.html#14 The term "UCS-2" is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days.

...

IMO, we should go back to the Python2 terms UCS2 and UCS4 which are correct and provide a clear description of what Python uses internally for code units.

No, we shouldn't. The term UCS-2 is deprecated, see above. Regards, Martin

Stephen J. Turnbull

8:11 p.m.

"Martin v. Löwis" writes:

...

The term "UCS-2" is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days.

So what do you propose we call the Python implementation? You can call it "code-unit-oriented" if you like, but in fact it is identical to UCS-2 for all non-hairsplitting purposes. AFAICS the Unicode Consortium deprecates the *term* UCS-2 because they would like us to avoid *implementations* that don't encode the full Unicode character set, not because the term is technically incorrect. Strictly speaking, internally Python only encodes 65536 characters in 2-octet builds. Its (Unicode) string-handling code does not know about surrogates at all, AFAIK, and therefore is not UTF-16 conforming. (The anomolies discussed here are type transformations, not string-handling, for my purpose.) I really don't see why we shouldn't call a UCS-2 implementation by its name. AFAIK this was not supposed to change in Python 3; indexing and slicing go by code unit (isomorphic to UCS-n), not character, and due to PEP 383 4-octet builds do not conform (internally) to UTF-32, and can produce output that conforms to Unicode not at all (as a user option, of course, but it's still non-conformant).

...

...
IMO, we should go back to the Python2 terms UCS2 and UCS4 which are correct and provide a clear description of what Python uses internally for code units.

No, we shouldn't. The term UCS-2 is deprecated, see above.

Too bad for the Unicode Consortium, I say. UCS-2 is the closest term that folks who are not Unicode geeks will have a chance of understanding. I agree with Marc-Andre that "narrow" and "wide" are too ambiguous to be useful. Many people will interpret that as "UTF-16" (or even "UTF-8") and "UTF-32", respectively, which is dead wrong. Others won't have a clue. Using "UCS-2" and "UCS-4" has the correct connotations to Unicode geeks, and they are easy to look up for non-geeks who care about precise definitions. Cf. the second half of the FAQ you quote: Instead, "UCS-2" has sometimes been used in the past to indicate that an implementation does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters. "Hey, Python, I'm looking at you!" (Strictly speaking, Python libraries do some of that for us, but the Python *language* does not.)

"Martin v. Löwis"

20 Nov 20 Nov

1:05 a.m.

Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:

...

"Martin v. Löwis" writes:

...
The term "UCS-2" is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days.

So what do you propose we call the Python implementation?

A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units.

...

Strictly speaking, internally Python only encodes 65536 characters in 2-octet builds. Its (Unicode) string-handling code does not know about surrogates at all, AFAIK

Here you are mistaken: it does indeed know about UTF-16 and surrogates in several places, e.g. in the UTF-8 codec, or in the repr() implementation; likewise in the parser.

...

and therefore is not UTF-16 conforming.

I disagree. Python does "conform" to "UTF-16" (certainly in the sense that no UTF-16 specification ever mandates a certain Python API, and that Python follows all general requirements of the UTF-16 specification).

...

AFAIK this was not supposed to change in Python 3; indexing and slicing go by code unit (isomorphic to UCS-n), not character, and due to PEP 383 4-octet builds do not conform (internally) to UTF-32, and can produce output that conforms to Unicode not at all (as a user option, of course, but it's still non-conformant).

What behavior specifically do you consider non-conforming, and what specific specification do you think it is not conforming to? For example, it *is* fully conforming with UTF-8. Regards, Martin

Alexander Belopolsky

2:32 p.m.

On Sat, Nov 20, 2010 at 4:05 AM, "Martin v. Löwis" wrote: ..

...

A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units.

+1 PEP 261 introduced terms "wide Py_UNICODE" and "narrow Py_UNICODE," but when discussion is at Python level, I don't think we should use names of C typedefs. I think "wide/narrow Unicode" builds describe the two options clearly and unambiguously. I prefer Python-specific terminology to Unicode terms because in Python reference documentation we often discuss details that are outside of the scope of Unicode Standard. For example, interpretation of lone surrogates on narrow builds is one such detail.

Stephen J. Turnbull

21 Nov 21 Nov

4:55 a.m.

"Martin v. Löwis" writes:

...

Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:

...
"Martin v. Löwis" writes:

...
The term "UCS-2" is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days.

So what do you propose we call the Python implementation?

A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units.

I agree that's technically correct. Unfortunately, it's also useless to anybody who doesn't already know more about Unicode than anybody should have to know.

...

...
and therefore is not UTF-16 conforming.

I disagree. Python does "conform" to "UTF-16"

I'm sure the codecs do. But the Unicode standard doesn't care about the parts of the process, it cares about what it does as a whole. Python's internal coding does not conform to UTF-16, and that internal coding can, under certain conditions, escape to the outside world as invalid "Unicode" output.

...

...
AFAIK this was not supposed to change in Python 3; indexing and slicing go by code unit (isomorphic to UCS-n), not character, and due to PEP 383 4-octet builds do not conform (internally) to UTF-32, and can produce output that conforms to Unicode not at all (as a user option, of course, but it's still non-conformant).

What behavior specifically do you consider non-conforming, and what specific specification do you think it is not conforming to? For example, it *is* fully conforming with UTF-8.

Oh, f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape') f.write(chr(int('dc80',16))) f.close() for one. That produces a non-UTF-8 file in a 32-bit-code-unit build. You can say, "oh, but that's not really a UTF-8 codec", and I'd agree. Nevertheless, the program is able to produce output from internal "Unicode" strings that does not conform to Unicode at all. A Unicode- conforming Python implementation would error at the chr() call, or perhaps would not provide surrogateescape error handlers. It is, of course, possible to write Python programs that conform (and easier than in any other language I know), but Python itself does not conform to post-1.1 Unicode standards. Too bad for the standards: "Although practicality beats purity." The point is that internal code is *not* UTF-16 (or -32), but it *is* isomorphic to UCS-2 (or -4). *That is very useful information to users*, it's not a technical detail of interest only to Unicode geeks. It means that if you stick to defined characters in the BMP when giving Python input, then slicing and indexing unicode (Python 2) or str (Python 3) objects gives only valid output even in builds with 16-bit code units. OTOH, invalid processing (involving functions like 'chr' or input using surrogateescape codecs) can lead to invalid output even in builds with 32-bit code units. IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what they need to know about the limitations of their Python vis-a-vis full conformance, at least with respect to the string manipulation functions.

R. David Murray

9:38 a.m.

On Sun, 21 Nov 2010 21:55:12 +0900, "Stephen J. Turnbull" wrote:

...

"Martin v. Löwis" writes:

...
Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:

...
"Martin v. Löwis" writes:

...
The term "UCS-2" is a character set that can encode only encode 65536 characters; it thus refers to Unicode 1.1. According to the Unicode Consortium's FAQ, the term UCS-2 should be avoided these days.

So what do you propose we call the Python implementation?

A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units.

I agree that's technically correct. Unfortunately, it's also useless to anybody who doesn't already know more about Unicode than anybody should have to know.

[...]

...

The point is that internal code is *not* UTF-16 (or -32), but it *is* isomorphic to UCS-2 (or -4). *That is very useful information to users*, it's not a technical detail of interest only to Unicode geeks. It means that if you stick to defined characters in the BMP when giving Python input, then slicing and indexing unicode (Python 2) or str (Python 3) objects gives only valid output even in builds with 16-bit code units. OTOH, invalid processing (involving functions like 'chr' or input using surrogateescape codecs) can lead to invalid output even in builds with 32-bit code units.

IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what they need to know about the limitations of their Python vis-a-vis full conformance, at least with respect to the string manipulation functions.

I'm sorry, but I have to disagree. As a relative unicode ignoramus, "UCS-2" and "UCS-4" convey almost no information to me, and the bits I have heard about them on this list have only confused me. On the other hand, I understand that 'narrow' means that fewer bytes are used for each internal character, meaning that some unicode characters need to be represented by more than one string element, and thus that slicing strings containing such characters on a narrow build causes problems. Now, you could tell me the same information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow' and 'wide', but to my ear 'narrow' and 'wide' convey a better gut level feeling for what is going on than 'UCS-2' and 'UCS-4' do. And it avoids any question of whether or not Python's internal representation actually conforms to whatever standard it is that UCS refers to, a point on which there seems to be some dissension. Having written the above, I googled for UCS-2 and got the Wikipedia article on UTF16/UCS-2 [1]. Scanning that article, I do not see anything that would clue me in to the problems of slicing strings in a Python narrow build. Indeed, reading that article with my limited unicode knowledge, if I were told Python used UCS-2, I would assume that non-BMP characters could not be processed by a Python narrow build. -- R. David Murray www.bitdance.com [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2

Raymond Hettinger

10:17 a.m.

On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:

...

I'm sorry, but I have to disagree. As a relative unicode ignoramus, "UCS-2" and "UCS-4" convey almost no information to me, and the bits I have heard about them on this list have only confused me.

From the users point of view, it doesn't much matter which encoding is used internally. Neither UTF-16 nor UCS-2 is exactly correct anyway. The former encodes the entire range of unicode characters in a variable length code (a character is usually 2 bytes but is sometimes 4 bytes long). The latter encodes only a subset of unicode (the basic mulitlingual plane) in a fixed-length code of bytes per character). What we use internally looks like utf-16 but a character encoded with 4 bytes is treated as two 2-byte characters (hence the subject of this thread). Our hybrid internal coding lets use handle the entire range of unicode while getting speed and simplicity by doing len() and slicing with a surrogate pair being treated as two separate characters). For the "wide" build, the entire range of unicode is encoded at 4 bytes per character and slicing/len operate correctly since every character is the same length. This used to be called UCS-4 and is now UTF-32. So, with "wide" builds there isn't much confusion (except perhaps unfamiliar terminology). The real issue seems to be that for "narrow" builds, none of the usual encoding names is exactly correct. From a users point-of-view, the actual encoding or encoding name doesn't matter much. They just need to be able to predict the relevant behaviors (memory consumption and len/slicing behavior). For the narrow build, that behavior is: - Characters in the BMP consume 2 bytes and count as one char for purposes of len and slicing. - Characters above the BMP consume 4 bytes and counts as two distinct chars for purpose of len and slicing. For wide builds, all characters are 4 bytes and count as a single char for len and slicing. Hope this helps, Raymond

R. David Murray

11:29 a.m.

On Sun, 21 Nov 2010 10:17:57 -0800, Raymond Hettinger wrote:

...

On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:

...
I'm sorry, but I have to disagree. As a relative unicode ignoramus, "UCS-2" and "UCS-4" convey almost no information to me, and the bits I have heard about them on this list have only confused me.

[...]

...

6rom a users point-of-view, the actual encoding or encoding name doesn't matter much. They just need to be able to predict the relevant behaviors (memory consumption and len/slicing behavior).

For the narrow build, that behavior is: - Characters in the BMP consume 2 bytes and count as one char for purposes of len and slicing. - Characters above the BMP consume 4 bytes and counts as two distinct chars for purpose of len and slicing.

For wide builds, all characters are 4 bytes and count as a single char for len and slicing.

Hope this helps,

Thank you, that nicely summarizes and confirms what I thought I knew about wide versus narrow build. And as I said, using the names UCS-2/UCS-4 would only *confuse* that understanding, not clarify it. -- R. David Murray www.bitdance.com

Stephen J. Turnbull

22 Nov 22 Nov

2:48 a.m.

Raymond Hettinger writes:

...

Neither UTF-16 nor UCS-2 is exactly correct anyway.

...

From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context.

...

For the "wide" build, the entire range of unicode is encoded at 4 bytes per character and slicing/len operate correctly since every character is the same length. This used to be called UCS-4 and is now UTF-32.

That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy the range restrictions of a UTF.

...

So, with "wide" builds there isn't much confusion (except perhaps unfamiliar terminology). The real issue seems to be that for "narrow" builds, none of the usual encoding names is exactly correct.

I disagree. I do see a problem with "UCS-2", because it fails to tell us that Python implements a large number of features that make it easy to do a very good job of working with non-BMP data in 16-bit builds of Python, with no extra effort. Python is not perfect, and (rarely) some of the imperfections may be very distressing. But it's very good, and deserves to be advertised as such. However, I don't see how "narrow" tells us more than "UCS-2" does. If "UCS-2" is equally (or more) informative, I prefer it because it is the technically precise, already well-defined, term.

...

From a users point-of-view, the actual encoding or encoding name doesn't matter much. They just need to be able to predict the relevant behaviors (memory consumption and len/slicing behavior).

"UCS-2" indicates those behaviors precisely and concisely. The problems are (a) the lack of familiarity of users with this term, if David is reasonably representative, and (b) the fact that it fails to advertise Python's UTF-16 capabilities. "Narrow" suffers from both of those problems, and further from the fact that it has no independent standard definition. Furthermore, "wide" has a very widespread, platform-dependent meaning derived from wchar_t. If we have to document what the terms we choose mean anyway, why not document the existing terms and reduce entropy, rather than invent new ones and increase entropy?

"Martin v. Löwis"

3:43 a.m.

Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:

...

Raymond Hettinger writes:

...
Neither UTF-16 nor UCS-2 is exactly correct anyway.

...
From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context.

I could only find the FCD of 10646:2010, where annex H was integrated into section 10: http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf There they have stopped using the term UCS-2, and added a note # NOTE – Former editions of this standard included references to a # two-octet BMP form called UCS-2 which would be a subset # of the UTF-16 encoding form restricted to the BMP UCS scalar values. # The UCS-2 form is deprecated. I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them. Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT PYTHON DOES. In a narrow Python build, the character set is *not* restricted to the BMP. Instead, Unicode strings are meant to be interpreted (by applications) as UTF-16.

...

...
For the "wide" build, the entire range of unicode is encoded at 4 bytes per character and slicing/len operate correctly since every character is the same length. This used to be called UCS-4 and is now UTF-32.

That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy the range restrictions of a UTF.

Not sure what it says in your copy; in mine, section 9.3 says # 9.3 UTF-32 (UCS-4) # UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS # scalar value to a single unsigned 32-bit code unit. The terms UTF-32 # and UCS-4 can be used interchangeably to designate this encoding # form. so they (now) view the two as synonyms. I think that when ISO 10646 started, they were also fairly confused about these issues (as the group/plane/row/cell structure demonstrates, IMO). This is not surprising, since the notion of byte-based character sets had been ingrained for so long. It took 20 years to learn that a UCS scalar value really is *not* a sequence of bytes, but a natural number.

...

However, I don't see how "narrow" tells us more than "UCS-2" does. If "UCS-2" is equally (or more) informative, I prefer it because it is the technically precise, already well-defined, term.

But it's not. It is a confusing term, one that the relevant standards bodies are abandoning. After reading FCD 10646:2010, I could agree to call the two implementations UTF-16 and UTF-32 (as these terms designate CEFs). Unfortunately, they also designate CESs.

...

If we have to document what the terms we choose mean anyway, why not document the existing terms and reduce entropy, rather than invent new ones and increase entropy?

Because the proposed existing term is deprecated. Regards, Martin

M.-A. Lemburg

4:47 a.m.

Martin, it is really irrelevant whether the standards have decided to no longer use the terms UCS-2 and UCS-4 in their latest standard documents. The definitions still stand (just like Unicode 2.0 is still a valid standard, even if it's ten years old): * UCS-2 is defined as "Universal Character Set coded in 2 octets" by ISO 10464: (see http://www.unicode.org/versions/Unicode5.2.0/appC.pdf) * UCS-4 is defined as "Universal Character Set coded in 4 octets" by ISO 10464. Those two terms have been in use for many years. They refer to the Unicode character set as it can be represented in 2 or 4 bytes. As such they don't include any of the special meanings associated with the UTF transfer encodings. There are no invalid sequences, no invalid code points, etc. as you can find in the UTF encodings. And that's an important detail. If you interpret them as encodings, they are 1-1 mappings of Unicode code point ordinals to integers represented using 2 or 4 bytes. UCS-2 only supports BMP code points and can conveniently be interpreted as UTF-16, if you need to encode non-BMP code points (which we do in the UTF codecs). UCS-4 also supports non-BMP code points directly. Now, from a ISO or Unicode Consortium point of view, deprecating the term UCS-2 in *their* standard papers is only natural, since they are actively starting to assign non-BMP code points which cannot be represented in UCS-2. However, this deprecation is only relevant for the purpose of defining the standard. The above definitions are still useful when it comes to defining code units, i.e. the used storage format, (as opposed to the transfer format). For the purpose of describing the code units we are using in Python they are (still) the most correct terms and that's also the reason why we chose to use them when introducing the configure options in Python2. There are no other accurate definitions we could use. The terms "narrow" and "wide" are simply too inaccurate to be used as description of UCS-2 and UCS-4 code units. Please also note that we have used the terms UCS-2 and UCS-4 in Python2 for 9+ years now and users are just starting to learn the difference and get acquainted with the fact that Python uses these two forms. Confronting them with "narrow" and "wide" builds is only going to cause more confusion, not less, and adding those strings to Python package files isn't going to help much either, since the terms don't convey any relationship to Unicode: package-3.1.3.linux-x86_64-py2.6_ucs2.egg vs. package-3.1.3.linux-x86_64-py2.6_narrow.egg I opt for switching to the following config options: --with-unicode=ucs2 (default) --with-unicode=ucs4 and using "UCS-2" and "UCS-4" in the Python documentation when describing the two different build modes. We can add glossary entries for the two which clarify the differences. Python2 used --enable-unicode=ucs2/ucs4, but since Python3 doesn't build without Unicode support, the above two versions appear more appropriate. We can keep the alternative --with-wide-unicode as an alias for --with-unicode=ucs4 to maintain 3.x backwards compatibility. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

...

Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:

...
Raymond Hettinger writes:

...
Neither UTF-16 nor UCS-2 is exactly correct anyway.

...
From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context.

I could only find the FCD of 10646:2010, where annex H was integrated into section 10:

http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf

There they have stopped using the term UCS-2, and added a note

# NOTE – Former editions of this standard included references to a # two-octet BMP form called UCS-2 which would be a subset # of the UTF-16 encoding form restricted to the BMP UCS scalar values. # The UCS-2 form is deprecated.

I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them.

Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT PYTHON DOES. In a narrow Python build, the character set is *not* restricted to the BMP. Instead, Unicode strings are meant to be interpreted (by applications) as UTF-16.

...
...
For the "wide" build, the entire range of unicode is encoded at 4 bytes per character and slicing/len operate correctly since every character is the same length. This used to be called UCS-4 and is now UTF-32.

That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy the range restrictions of a UTF.

Not sure what it says in your copy; in mine, section 9.3 says

# 9.3 UTF-32 (UCS-4) # UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS # scalar value to a single unsigned 32-bit code unit. The terms UTF-32 # and UCS-4 can be used interchangeably to designate this encoding # form.

so they (now) view the two as synonyms.

I think that when ISO 10646 started, they were also fairly confused about these issues (as the group/plane/row/cell structure demonstrates, IMO). This is not surprising, since the notion of byte-based character sets had been ingrained for so long. It took 20 years to learn that a UCS scalar value really is *not* a sequence of bytes, but a natural number.

...
However, I don't see how "narrow" tells us more than "UCS-2" does. If "UCS-2" is equally (or more) informative, I prefer it because it is the technically precise, already well-defined, term.

But it's not. It is a confusing term, one that the relevant standards bodies are abandoning. After reading FCD 10646:2010, I could agree to call the two implementations UTF-16 and UTF-32 (as these terms designate CEFs). Unfortunately, they also designate CESs.

...
If we have to document what the terms we choose mean anyway, why not document the existing terms and reduce entropy, rather than invent new ones and increase entropy?

Because the proposed existing term is deprecated.

Regards, Martin

James Y Knight

6:18 a.m.

Why don't ya'll just call them "--unichar-width=16/32". That describes precisely what the options do, and doesn't invite any quibbling over definitions. James

Nick Coghlan

7:37 a.m.

On Mon, Nov 22, 2010 at 10:47 PM, M.-A. Lemburg wrote:

...

Please also note that we have used the terms UCS-2 and UCS-4 in Python2 for 9+ years now and users are just starting to learn the difference and get acquainted with the fact that Python uses these two forms.

Confronting them with "narrow" and "wide" builds is only going to cause more confusion, not less, and adding those strings to Python package files isn't going to help much either, since the terms don't convey any relationship to Unicode:

I was personally surprised to learn in this discussion that there had even been an *attempt* to change the names of the two build variants to anything other than UCS2/UCS4. The concrete API implementations certainly still use those two terms to prevent inadvertent linkage with the wrong version of the C API. For practical purposes, UCS2/UCS4 convey far more inherent information than narrow/wide: - many developers will recognise them as Unicode related, even if they don't know exactly what they mean - even those that don't recognise them, can soon learn that they're Unicode related just by plugging them into Google* - a bit more digging should reveal that they're Unicode storage formats closely related to the UTF-16 and UTF-32 transfer encodings respectively* *(The first Google hit for "ucs2" is the UTF-16/UCS-2 article on Wikipedia, the first hit for "ucs4" is the UTF-32/UCS-4 article) All that just armed with Google, without even looking at the Python docs specifically. So don't just think about "what will developers know?", also think about "what will developers know, and what will a quick trip to a search engine tell them?". And once you take that stance, the overly generic narrow/wide terms fail, badly. +1 for MAL's suggested tweaks to the Py3k configure options. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Alexander Belopolsky

8:03 a.m.

On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan wrote: ..

...

*(The first Google hit for "ucs2" is the UTF-16/UCS-2 article on Wikipedia, the first hit for "ucs4" is the UTF-32/UCS-4 article)

Do you think these articles are helpful for someone learning how to use chr() and ord() in Python for the first time?

Nick Coghlan

8:13 a.m.

On Tue, Nov 23, 2010 at 2:03 AM, Alexander Belopolsky wrote:

...

On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan wrote: ..

...
*(The first Google hit for "ucs2" is the UTF-16/UCS-2 article on Wikipedia, the first hit for "ucs4" is the UTF-32/UCS-4 article)

Do you think these articles are helpful for someone learning how to use chr() and ord() in Python for the first time?

No, that's what the documentation of chr() and ord() is for. For that use case, it doesn't matter *what* the terms are. They could say "in a FOO build this will do X, in a BAR build it will do Y, see <link> for a detailed explanation of the differences between FOO and BAR builds of Python" and be perfectly adequate for the task. If there is no appropriate documentation link to point to (probably somewhere in the C API docs if it isn't anywhere else) then that is a key issue that needs to be fixed, rather than trying to change the terms that have been in use for the better part of a decade already. The raw meaning of UCS2/UCS4 mainly comes into the story when people are encountering this as a config option when building Python. The whole idea of changing the terms for the two build types *should* have been short circuited by the "status quo wins a stalemate" guideline, but apparently that didn't happen at the time. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Alexander Belopolsky

9 a.m.

On Mon, Nov 22, 2010 at 11:13 AM, Nick Coghlan wrote: ..

...

...
Do you think these articles are helpful for someone learning how to use chr() and ord() in Python for the first time?

No, that's what the documentation of chr() and ord() is for. For that use case, it doesn't matter *what* the terms are.

I recently updated chr() and ord() documentation and used "narrow/wide" terms. I thought USC2/4 proponents objected to that on the basis that these terms are imprecise. http://docs.python.org/dev/library/functions.html#chr http://docs.python.org/dev/library/functions.html#ord

...

They could say "in a FOO build this will do X, in a BAR build it will do Y, see <link> for a detailed explanation of the differences between FOO and BAR builds of Python" and be perfectly adequate for the task. If there is no appropriate documentation link to point to (probably somewhere in the C API docs if it isn't anywhere else) then that is a key issue that needs to be fixed, rather than trying to change the terms that have been in use for the better part of a decade already.

That's the point that I was trying to make. Using somewhat vague narrow/wide terms gives us an opportunity to describe exactly what is going on without confusing the reader with the intricacies of the Unicode Standard or Python'd compliance with a particular version of it.

...

The raw meaning of UCS2/UCS4 mainly comes into the story when people are encountering this as a config option when building Python. The whole idea of changing the terms for the two build types *should* have been short circuited by the "status quo wins a stalemate" guideline, but apparently that didn't happen at the time.

It also comes in the "Data model" reference section on String which is currently out of date: """ Strings The items of a string object are Unicode code units. A Unicode code unit is represented by a string object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items. The built-in functions chr() and ord() convert between code units and nonnegative integers representing the Unicode ordinals as defined in the Unicode Standard 3.0. Conversion from and to other encodings are possible through the string method encode(). """ http://docs.python.org/dev/reference/datamodel.html The out of date part is the reference to the Unicode Standard 3.0. I don't think we should refer to a specific version of Unicode here. It has little consequence for the "Python data model" and AFAICT does not come into play anywhere except unicodedata which is currently at version 6.0. The description of chr() and ord() is also not accurate on narrow builds and nether is the statement "The items of a string object are Unicode code units."

R. David Murray

9:30 a.m.

On Mon, 22 Nov 2010 12:00:14 -0500, Alexander Belopolsky wrote:

...

I recently updated chr() and ord() documentation and used "narrow/wide" terms. I thought USC2/4 proponents objected to that on the basis that these terms are imprecise.

For reference, a grep in py3k/Doc reveals that there are currently exactly 23 lines mentioning UCS2 or UCS4 in the docs. Most are in the unicode part of the c-api, and 6 are in what's new for 2.2: c-api/arg.rst: Convert a null-terminated buffer of Unicode (UCS-2 or UCS-4) data to a Python c-api/arg.rst: Convert a Unicode (UCS-2 or UCS-4) data buffer and its length to a Python c-api/unicode.rst: for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also c-api/unicode.rst: possible to build a UCS4 version of Python (most recent Linux distributions come c-api/unicode.rst: with UCS4 builds of Python). These builds then use a 32-bit type for c-api/unicode.rst: :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms c-api/unicode.rst: short` (UCS2) or :c:type:`unsigned long` (UCS4). c-api/unicode.rst:Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep c-api/unicode.rst: values is interpreted as an UCS-2 character. whatsnew/2.2.rst:usually stored as UCS-2, as 16-bit unsigned integers. Python 2.2 can also be whatsnew/2.2.rst:compiled to use UCS-4, 32-bit unsigned integers, as its internal encoding by whatsnew/2.2.rst:supplying :option:`--enable-unicode=ucs4` to the configure script. (It's also whatsnew/2.2.rst:When built to use UCS-4 (a "wide Python"), the interpreter can natively handle whatsnew/2.2.rst:compiled to use UCS-2 (a "narrow Python"), values greater than 65535 will still whatsnew/2.2.rst:Marc-André Lemburg. The changes to support using UCS-4 internally were howto/unicode.rst:.. comment Additional topic: building Python w/ UCS2 or UCS4 support howto/unicode.rst: - [ ] Building Python (UCS2, UCS4) library/sys.rst: characters are stored as UCS-2 or UCS-4. library/json.rst: specified. Encodings that are not ASCII based (such as UCS-2) are not faq/extending.rst:When importing module X, why do I get "undefined symbol: PyUnicodeUCS2*"? faq/extending.rst:If instead the name of the undefined symbol starts with ``PyUnicodeUCS4``, the faq/extending.rst: ... print('UCS4 build') faq/extending.rst: ... print('UCS2 build') -- R. David Murray www.bitdance.com

Alexander Belopolsky

9:37 a.m.

On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray wrote: ..

...

For reference, a grep in py3k/Doc reveals that there are currently exactly 23 lines mentioning UCS2 or UCS4 in the docs.

Did you grep for USC-2 and USC-4 as well? I have to admit that my aversion to these terms is mostly due to the fact that I don't know how to spell them correctly. :-)

R. David Murray

11:50 a.m.

On Mon, 22 Nov 2010 12:37:59 -0500, Alexander Belopolsky wrote:

...

On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray wrote: ..

...
For reference, a grep in py3k/Doc reveals that there are currently exactly 23 lines mentioning UCS2 or UCS4 in the docs.

Did you grep for USC-2 and USC-4 as well? I have to admit that my aversion to these terms is mostly due to the fact that I don't know how to spell them correctly. :-)

I grepped using "-ri ucs." and eliminated the false positives (of which there were only a few) by hand. -- R. David Murray www.bitdance.com

Stephen J. Turnbull

23 Nov 23 Nov

8:18 a.m.

Nick Coghlan writes:

...

For practical purposes, UCS2/UCS4 convey far more inherent information than narrow/wide:

That was my stance, but in fact (1) the ISO JTC1/SC2 has deliberately made them ambiguous by changing their definitions over the years[1], and (2) the more recent definitions and "interpretations" of UCS-2 *prohibit* use of surrogates in UCS-2 as far as I can tell. And that's what you'll see everywhere you look, because Wikipedia and friends pick up the most recent versions of everything.

...

So don't just think about "what will developers know?", also think about "what will developers know, and what will a quick trip to a search engine tell them?".

It will tell them that UCS-2 cannot even *express* non-BMP characters. Terry and David are *not* dummies, and that's what they got from more or less careful study of the issue.

...

And once you take that stance, the overly generic narrow/wide terms fail, badly.

I still agree that something more accurate would be nice, but face it: the ISO will redefine and deprecate such terms as soon as they notice us using them.<wink>

...

+1 for MAL's suggested tweaks to the Py3k configure options.

Despite my natural sympathy for your arguments, and MAL's, I'm still -1. I really wish I could switch back, but it seems to me that "UCS-2" is a liability we don't need, *especially* on Windows where the default build is presumably going to be UCS2 forever. Footnotes: [1] You'd think it would be hard to change the definition of UCS-4, but they managed. :-(

Stephen J. Turnbull

7 a.m.

If you don't care about the ISO standard, but only about Python, Martin's right, I was wrong. You can stop reading now.<wink> "Martin v. Löwis" writes:

...

I could only find the FCD of 10646:2010, where annex H was integrated into section 10:

Thank you for the reference. I referred to two older versions, 10646-1:1993 (for the annexes and Amendment, and my basic understanding) and 10646:2003 (for the detailed definition of UCS-2 in Sections 7, 8 and 13; unfortunately, I missed the most important detail, which is in Section 9). In :2003 the Annex I referred to as "Annex H" is Annex J, and "Annex Q" is partly in Section 9.1 and mostly in Annex C. I don't know where the former is in the 2010 FCD, and the latter is section 9.2.

...

I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them.

In :1993 it wasn't ambiguous, they simply didn't make those distinctions. They were not needed for ISO 10646's published versions, although they certainly are for Unicode. Now, quite clearly, the ISO has *changed the definition* in every new version, progressively adding new restrictions that go beyond clarifying ambiguity. But even in :2003, in view of 4.2, 6.2, 6.3, and 13.1, UCS-2 is clearly well-defined as a CM according to UTR#17, which can probably be identified with CCS in :2003 terminology. Ie, returning to UTR#17 terminology, it is the composition of a CES, a CEF, and a CCS, which are not defined individually. Note: The definition of "coded character" changed between :2003 and the 2010 FCD, from "character with representation" to "character with integer". There is a NOTE indicating that 16-bit integers may be used in processing. Given that this is a non-normative note, I take it to mean that in an array of 16-bit integers, "most significant octet" is to be interpreted in the natural way for the architecture rather than by the representation in memory, which might be little-endian. IMO it's unnatural to think that that changes the definition of UCS-2 to be either a CEF, or a composition of a CEF and a CCS.

...

Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP.

I think that ISO 10646-1:1993 admits only one interpretation, a CM restricted to the BMP (including surrogates), and ISO 10646:2003 admits only one interpretation, a CM restricted to the BMP (not including surrogates). The note under Table 4 on p.24 of the FCD is, uh, well, a lie. Earlier versions certainly did not restrict to "scalar values"; they had no such concept.

...

THIS IS NOT WHAT PYTHON DOES.

Well, no shit, Sherlock. You don't have to yell at me, I know what Python does. The question is, is what does UCS-2 do? The answer is that in :1993, AFAICT it did what Python does. In :2003, they added (last sentence, section 9.1): UCS-2 cannot be used to represent any characters on the supplementary planes. I assume they maintain that position in 2010, so End Of Thread. I apologize for missing that when I was reviewing the standard earlier, but I expected restrictions on UCS-2 to be explained in 13.1 or perhaps 14. And 13.1 simply requires that characters in the BMP be represented by their defined code positions, truncated to two octets. Like earlier versions, it doesn't prohibit use of surrogates or say that non-BMP characters can't be represented.

...

Not sure what it says in your copy; in mine, section 9.3 says

[snip] Mine (:2003) says "NOTE 2 - When confined to the code positions in Planes 00 to 10, UCS-4 is also referred to as UCS Transformation Format 32 (UTF-32)." Then it references the Unicode Standard (v4.0) as the authority for UTF-32. Obviously they continued to be confused at this point in time; by the draft you have, apparently the WG had decided to pretty much completely synchronize the whole standard to a subset of Unicode. This seems pointless to me (unlike, say, the work that has been done on standardizing criteria for repertoire changes). In particular, the :1993 definition of UCS-2 was a perfectly good standard for describing the processing Python actually does internally. The current definition of UCS-2 as identical to the BMP is useless, and good riddance, I'm perfectly happy to have them deprecate it.

Terry Reedy

22 Nov 22 Nov

9:41 a.m.

On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote:

...

I disagree. I do see a problem with "UCS-2", because it fails to tell us that Python implements a large number of features that make it easy to do a very good job of working with non-BMP data in 16-bit builds of

Yes. As I read the standard, UCS-2 is limited to BMP chars. So I was a bit confused when Python was described as UCS-2, until I realized that the term was inaccurate. Using that term punishes people like me who take the time to read the standard or otherwise learn what the term means. What Python does might be called USC-2+ or UCS-2e (xtended). -- Terry Jan Reedy

Raymond Hettinger

10:29 a.m.

On Nov 22, 2010, at 9:41 AM, Terry Reedy wrote:

...

On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote:

...
I disagree. I do see a problem with "UCS-2", because it fails to tell us that Python implements a large number of features that make it easy to do a very good job of working with non-BMP data in 16-bit builds of

Yes. As I read the standard, UCS-2 is limited to BMP chars. So I was a bit confused when Python was described as UCS-2, until I realized that the term was inaccurate. Using that term punishes people like me who take the time to read the standard or otherwise learn what the term means.

Bingo! Thanks for the excellent summary of the problem.

...

What Python does might be called USC-2+ or UCS-2e (xtended).

That would be a step in the right direction. Raymond

Alexander Belopolsky

11:09 a.m.

On Mon, Nov 22, 2010 at 12:41 PM, Terry Reedy wrote: ..

...

What Python does might be called USC-2+ or UCS-2e (xtended).

Wow! I am not the only one who can't get the order of letters right in these acronyms. (I am usually consistent within one sentence, though.) :-) I-can't-spell-three-letter-acronyms-right-ly yours ...

Stephen J. Turnbull

23 Nov 23 Nov

4:15 a.m.

Terry Reedy writes:

...

Yes. As I read the standard, UCS-2 is limited to BMP chars.

Et tu, Terry? OK, I change my vote on the suggestion of "UCS2" to -1. If a couple of conscientious blokes like you and David both understand it that way, I can't see any way to fight it. FWIW, ISO/IEC 10646 (which is authoritative for UCS-2 and UCS-4) is available via http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html Probably I'm the last non-author to ever read that document!

Raymond Hettinger

22 Nov 22 Nov

10:13 a.m.

On Nov 22, 2010, at 2:48 AM, Stephen J. Turnbull wrote:

...

Raymond Hettinger writes:

...
Neither UTF-16 nor UCS-2 is exactly correct anyway.

From a standards lawyer point of view, UCS-2 is exactly correct,

You're twisting yourself into definitional knots. Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two The term UCS-2 is a complete communications failure in that regard. If someone looks up the term, they will immediately see something like the wikipedia entry which says, "UCS-2 cannot represent code points outside the BMP". How is that helpful? Raymond

M.-A. Lemburg

10:53 a.m.

Raymond Hettinger wrote:

...

Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two

The term UCS-2 is a complete communications failure in that regard. If someone looks up the term, they will immediately see something like the wikipedia entry which says, "UCS-2 cannot represent code points outside the BMP". How is that helpful?

It's very helpful, since it explains why a UCS-2 build of Python requires a surrogates pair to represent a non-BMP code point and explains why chr(i) gives you a length 2 string rather than a length 1 string. A UCS-4 build does not need to use surrogates for this, hence you get a length 1 string from chr(i). There are two levels we have to explain to users: 1. the transfer level 2. the storage level The UTF encodings address the transfer level and is what you deal with in I/O. These provide variable length encodings of the complete Unicode code point range, regardless of whether you have a UCS-2 or a UCS-4 build. The storage level becomes important if you want to work on strings using indexing and slicing. Here you do have to know whether you're dealing with a UCS-2 or a UCS-4 build, since the indexes will vary if you're using non-BMP code points. Finally, to tie both together, we have to explain that UTF-16 (the transfer encoding) maps to UCS-2 in a straight-forward way, so it is possible to work with a UCS-2 build of Python and still use the complete Unicode code point range - you only have to take into consideration, that Python's string indexing will not necessarily point you to n-th code point in a string, but may well give you half or a surrogate. Note that while that last aspect may appear like a good argument for UCS-4 builds, in reality it is not. UCS-4 has the same issue on a different level: the letters that get printed on the screen or printer (graphemes) may well be made up of multiple combining code points, e.g. an "e" and an "´". Those again map to two indexes in the Python string, even though, the appear to be one character on output. Now try to explain all of the above using the terms "narrow" and "wide" (while remembering "explicit is better than implicit" and "avoid the temptation to guess") :-) It is not really helpful to replace a correct and accurate term with a fuzzy term: either way we're stuck with the semantics. However, the correct and accurate terms at least give you a chance to figure out and understand the reasoning behind the design. UCS-2 vs. UCS-4 is a trade-off, "narrow" and "wide" is marketing talk with an implicit emphasis on one side :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Alexander Belopolsky

23 Nov 23 Nov

11:11 a.m.

On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger wrote: ..

...

Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two

This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not surprised to find that textwrap does not handle the issue that well:

...

...
...
len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U00000140' * 80, 20)) 8

That module should probably be rewritten to properly implement the Unicode line breaking algorithm http://unicode.org/reports/tr14/tr14-22.html. Yet finding a bug in a str object method after a 5 min review was a bit discouraging:

...

...
...
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: The fill character must be exactly one character long

Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. As Wikipedia explains, [1] """ Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """ Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to cover only BMP, maybe rather than changing the terms used in the reference manual, we should tighten the code to conform to the updated standards? Again, given that the str object itself has at least one non-BMP character bug as we are closing on the third major release of py3k, how likely are 3rd party developers to get their libraries right as they port to 3.x? [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 [2] http://unicode.org/reports/tr17/#CharacterEncodingForm

M.-A. Lemburg

11:31 a.m.

Alexander Belopolsky wrote:

...

On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger wrote: ..

...
Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two

This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not surprised to find that textwrap does not handle the issue that well:

...
...
...
len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U00000140' * 80, 20)) 8

That module should probably be rewritten to properly implement the Unicode line breaking algorithm http://unicode.org/reports/tr14/tr14-22.html.

Yet finding a bug in a str object method after a 5 min review was a bit discouraging:

...
...
...
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: The fill character must be exactly one character long

Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them.

What's the alternative ? Without surrogates, Python users with UCS-2 build (e.g. the Windows Python users) would not be allowed to play with non-BMP code points. IMHO, it's better to fix the stdlib. This is a long process, as you can see with the Python3 stdlib evolution, but Python will eventually get there.

...

As Wikipedia explains, [1]

""" Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """

Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to cover only BMP, maybe rather than changing the terms used in the reference manual, we should tighten the code to conform to the updated standards?

Can we please stop turning this around over and over again :-) UCS-2 has never supported anything other than the BMP. However, you can interpret sequences of UCS-2 code unit as UTF-16 and then get access to the full Unicode character set. We've been doing this in codecs ever since UCS-4 builds were introduced some 8-9 years ago. The change to have chr(i) return surrogates on UCS-2 builds was perhaps done too early, but then, without such changes you'd never notice that your code doesn't work well with surrogates. It's just one piece of the puzzle when going from 8-bit strings to Unicode.

...

Again, given that the str object itself has at least one non-BMP character bug as we are closing on the third major release of py3k, how likely are 3rd party developers to get their libraries right as they port to 3.x?

[1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 [2] http://unicode.org/reports/tr17/#CharacterEncodingForm

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 23 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

Terry Reedy

2:44 p.m.

On 11/23/2010 2:11 PM, Alexander Belopolsky wrote:

...

This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not

Good idea!

...

surprised to find that textwrap does not handle the issue that well:

...
...
...
len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U00000140' * 80, 20)) 8

How well does textwrap handles composable pairs (letter + accent)? Does is count two codepoints as one char space? and avoid putting line breaks between? I suspect textwrap should be regarded as (extended?)_ascii_textwrap.

...

That module should probably be rewritten to properly implement the Unicode line breaking algorithm http://unicode.org/reports/tr14/tr14-22.html.

Probably a good idea

...

Yet finding a bug in a str object method after a 5 min review was a bit discouraging:

...
...
...
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "<stdin>", line 1, in<module> TypeError: The fill character must be exactly one character long

Again, what does it do with letter + decorator combinations? It seems to me that the whole notion that one code point == one printed character space is broken once one leaves ascii. Perhaps we need an is_uchar function to recognize multi-code sequences, inclusing surrogate pairs, that represent one char for the purpose of character oriented functions.

...

Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. As Wikipedia explains, [1]

""" Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """

So we did not test thoroughly enough and need to add appropriate unit tests as bugs are fixed. -- Terry Jan Reedy

Greg Ewing

3:49 p.m.

Alexander Belopolsky wrote:

...

""" Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """

Maybe Python should have used UTF-8 as its internal unicode representation. Then people who were foolish enough to assume one character per string item would have their programs break rather soon under only light unicode testing. :-) -- Greg

James Y Knight

4:22 p.m.

On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:

...

Maybe Python should have used UTF-8 as its internal unicode representation. Then people who were foolish enough to assume one character per string item would have their programs break rather soon under only light unicode testing. :-)

You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints. Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme (that is: the set of combining characters + base character that go together, making up one thing which a human would think of as a character). James

Glyph Lefkowitz

5:52 p.m.

On Nov 23, 2010, at 7:22 PM, James Y Knight wrote:

...

On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:

...
Maybe Python should have used UTF-8 as its internal unicode representation. Then people who were foolish enough to assume one character per string item would have their programs break rather soon under only light unicode testing. :-)

You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints.

Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme (that is: the set of combining characters + base character that go together, making up one thing which a human would think of as a character).

I really hope that this idea is not just for new programming languages. If you switch from doing unicode "wrong" to doing unicode "right" in Python, you quadruple the memory footprint of programs which primarily store and manipulate large amounts of text. This is especially ridiculous in PyGTK applications, where the GUI's internal representation required by the GUI UTF-8 anyway, so the round-tripping of string data back and forth to the exploded UTF-32 representation is wasting gobs of memory and time. It at least makes sense when your C library's idea about character width and your Python build match up. But, in a desktop app this is unlikely to be a performance concern; in servers, it's a big deal; measurably so. I am pretty sure that in the server apps that I work on, we are eventually going to need our own string type and UTF-8 logic that does exactly what James suggested - certainly if we ever hope to support Py3. (I dimly recall that both James and I have made this point before, but it's pretty important, so it bears repeating.)

Stephen J. Turnbull

6:44 p.m.

James Y Knight writes:

...

You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints.

A sad commentary on the state of Emacs usage, "nobody". The theory is that accessing the first character of a region in a string often occurs as a primitive operation in O(N) or worse algorithms, sometimes without enough locality at the "collection of regions" level to give a reasonably small average access time. In practice, any *Emacs user can tell you that yes, we do need to be able to access the Nth codepoint in a buffer in constant time. The O(N) behavior of current Emacs implementations means that people often use a binary coding system on large files. Yes, some position caching is done, but if you have a large file (eg, a mail file) which is virtually segmented using pointers to regions, locality gets lost. (This is not a design bug, this is a fundamental requirement: consider fast switching between threaded view and author-sorted view.) And of course an operation that sorts regions in a buffer using character pointers will have the same problem. Working with memory pointers, OTOH, sucks more than that; GNU Emacs recently bit the bullet and got rid of their higher-level memory-oriented APIs, all of the Lisp structures now work with pointers, and only the very low-level structures know about character-to-memory pointer translation. This performance issue is perceptible even on 3GHz machines with not so large (50MB) mbox files. It's *horrid* if you do something like "occur" on a 1GB log file, then try randomly jumping to detected log entries.

Glyph Lefkowitz

7:27 p.m.

On Nov 23, 2010, at 9:44 PM, Stephen J. Turnbull wrote:

...

James Y Knight writes:

...
You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints.

A sad commentary on the state of Emacs usage, "nobody".

The theory is that accessing the first character of a region in a string often occurs as a primitive operation in O(N) or worse algorithms, sometimes without enough locality at the "collection of regions" level to give a reasonably small average access time.

I'm not sure what you mean by "the theory is". Whose theory? About what?

...

In practice, any *Emacs user can tell you that yes, we do need to be able to access the Nth codepoint in a buffer in constant time. The O(N) behavior of current Emacs implementations means that people often use a binary coding system on large files. Yes, some position caching is done, but if you have a large file (eg, a mail file) which is virtually segmented using pointers to regions, locality gets lost. (This is not a design bug, this is a fundamental requirement: consider fast switching between threaded view and author-sorted view.)

Sounds like a design bug to me. Personally, I'd implement "fast switching between threaded view and author-sorted view" the same way I'd address any other multiple-views-on-the-same-data problem. I'd retain data structures for both, and update them as the underlying model changed. These representations may need to maintain cursors into the underlying character data, if they must retain giant wads of character data as an underlying representation (arguably the _main_ design bug in Emacs, that it encourages you to do that for everything, rather than imposing a sensible structure), but those cursors don't need to be code-point counters; they could be byte offsets, or opaque handles whose precise meaning varied with the potentially variable underlying storage. Also, please remember that Emacs couldn't be implemented with giant Python strings anyway: crucially, all of this stuff is _mutable_ in Emacs.

...

And of course an operation that sorts regions in a buffer using character pointers will have the same problem. Working with memory pointers, OTOH, sucks more than that; GNU Emacs recently bit the bullet and got rid of their higher-level memory-oriented APIs, all of the Lisp structures now work with pointers, and only the very low-level structures know about character-to-memory pointer translation.

This performance issue is perceptible even on 3GHz machines with not so large (50MB) mbox files. It's *horrid* if you do something like "occur" on a 1GB log file, then try randomly jumping to detected log entries.

Case in point: "occur" needs to scan the buffer anyway; you can't do better than linear time there. So you're going to iterate through the buffer, using one of the techniques that James proposed, and remember some locations. Why not just have those locations be opaque cursors into your data? In summary: you're right, in that James missed a spot. You need bidirectional, *copyable* iterators that can traverse the string by byte, codepoint, grapheme, or decomposed glyph.

Stephen J. Turnbull

9:07 p.m.

Note that I'm not saying that there shouldn't be a UTF-8 string type; I'm just saying that for some purposes it might be a good idea to keep UTF-16 and UTF-32 string types around. Glyph Lefkowitz writes:

...

...
The theory is that accessing the first character of a region in a string often occurs as a primitive operation in O(N) or worse algorithms, sometimes without enough locality at the "collection of regions" level to give a reasonably small average access time.

I'm not sure what you mean by "the theory is". Whose theory? About what?

Mine. About why somebody somewhere someday would need fast random access to character positions. "Nobody ever needs that" is a strong claim.

...

...
In practice, any *Emacs user can tell you that yes, we do need to be able to access the Nth codepoint in a buffer in constant time. The O(N) behavior of current Emacs implementations means that people often use a binary coding system on large files. Yes, some position caching is done, but if you have a large file (eg, a mail file) which is virtually segmented using pointers to regions, locality gets lost. (This is not a design bug, this is a fundamental requirement: consider fast switching between threaded view and author-sorted view.)

Sounds like a design bug to me. Personally, I'd implement "fast switching between threaded view and author-sorted view" the same way I'd address any other multiple-views-on-the-same-data problem. I'd retain data structures for both, and update them as the underlying model changed.

Um, that's precisely the design I'm talking about. But as you recognize later, the message content is not part of those structures because there's no real point in copying it *if you have fast access to character positions*. In a variable width character, character- addressed design, there can be a perceptible delay in accessing even the "next" message's content if you're in the wrong view.

...

These representations may need to maintain cursors into the underlying character data, if they must retain giant wads of character data as an underlying representation (arguably the _main_ design bug in Emacs, that it encourages you to do that for everything, rather than imposing a sensible structure), but those cursors don't need to be code-point counters; they could be byte offsets, or opaque handles whose precise meaning varied with the potentially variable underlying storage.

Both byte offsets and opaque handles really really suck to design, implement, and maintain, if Lisp or Python level users can use them. They're hard enough to do when you can hide them behind internal APIs, but if they're accessible to users they're an endless source of user bugs. What was that you were saying about the difficulty of remembering which argument is the fd? It's like that. Sure, you can design APIs to help get that right, but it's not easy to provide one that can be used for all the different applications out there.

...

Also, please remember that Emacs couldn't be implemented with giant Python strings anyway: crucially, all of this stuff is _mutable_ in Emacs.

No, that's a red herring. The use-cases where Emacs users complain most is browsing giant logs and reading old mail; neither needs the content to be mutable (although of course it's a convenience in the mail case if you delete messages or fetch new mail, but that could be done with transaction logs that get appended to the on-disk file).

...

Case in point: "occur" needs to scan the buffer anyway; you can't do better than linear time there. So you're going to iterate through the buffer, using one of the techniques that James proposed, and remember some locations. Why not just have those locations be opaque cursors into your data?

They are. But unless you're willing to implement correct character motion, they need to be character indicies, which will be slow to access the actual locations. We've implemented caches, as does Emacs, but they don't always get hits. Finding an arbitrary position once can involve perceptible delay on up to 1GHz machines; doing it in a loop (which mail programs have a habit of doing) could be very painful.

...

In summary: you're right, in that James missed a spot. You need bidirectional, *copyable* iterators that can traverse the string by byte, codepoint, grapheme, or decomposed glyph.

That's a good start, yes. But once you talk about "remembering some locations", you're implicitly talking about random access. Either you maintain position indexes which naively implemented can easily be close to the size of the text buffer (indexes are going to be at least 4 bytes, possibly 8, per position, and something like "occur" can generate a lot of positions) -- in which case you might as well just use a representation that is an array in the first place -- or you need to implement a position cache which can be very hairy to do well. Or you can give user programs memory indicies, and enjoy the fun as the poor developers do things like "pos += 1" which works fine on the ASCII data they have lying around, then wonder why they get Unicode errors when they take substrings. I'm sure it all can be done, but I don't think it will be done right the first time around. You may be right that designs better adapted to large data sets than Emacs's "everything is a big buffer" will almost always be available with reasonable effort. But remember, a lot of good applications start small, when a flat array might make lots of sense as the underlying structure, and then need to scale. If you need to scale for the paying customers, well, "ouch!" but you can afford it, but for many volunteer or startup projects that takes the wind right out of your sails. Note that if the user doesn't use private space, in a UCS-2 build you have about 1.5K code points available for compressing non-BMP characters into a 2-byte, valid Unicode representation (of course you need to save off the table somewhere if that ever gets out of your program, but that's easy). I find it hard to imagine that there will be many use-cases that need more than that many non-BMP characters. So probably you can tell those few users who care to use a UCS-4 build; most of the array use-cases can be served by UCS-2. Note that in my Japanese corpuses, UTF-8 averages just about 2 bytes per character anyway, and those are mail files, where two lines of Japanese may be preceded by 2KB of ASCII-only header. I suspect Hebrew, Arabic, and Cyrillic users will have similar experiences. By the way, to send the ball back into your court, I have this feeling that the demand for UTF-8 is once again driven by native English speakers who are very shortly going to find themselves, and the data they are most familiar with, very much in the minority. Of course the market that benefits from UTF-8 compression will remain very large for the immediate future, but in the grand scheme of things, most of the world is going to prefer UTF-16 by a substantial margin. N.B. I'm not talking about persistent storage, where it's 6 of one and half a dozen of the other; you can translate UTF-8 to UTF-16 way faster than you can read content from disk, of course.

James Y Knight

10:26 p.m.

On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:

...

Or you can give user programs memory indicies, and enjoy the fun as the poor developers do things like "pos += 1" which works fine on the ASCII data they have lying around, then wonder why they get Unicode errors when they take substrings.

a) You seem to be hung up implementation details of emacs. But yes, positions should be stored as an byte offset into the utf8 string. NOT as number of codepoints since the beginning of the string. Probably you want it to be somewhat opaque, so that you actually have to specify whether you wanted to go to +1 byte, codepoint, or grapheme. b) Those poor developers are *already* screwed if they're using pos += 1 when pos is a codepoint index and they then take a substring based on that! They will get half a character when the string contains combining characters... Pretending that "codepoints" are a useful abstraction just makes poor developers get by without doing the correct thing (incrementing to the next grapheme boundary) for a little bit longer. But once you [the language implementor] are providing correct abstractions for grapheme movement, it's just as easy to also provide an abstraction for codepoint movement, and make your low-level implementation of the iterator object be a byte-offset into a UTF8 buffer. James

Stephen J. Turnbull

24 Nov 24 Nov

1:03 a.m.

James Y Knight writes:

...

a) You seem to be hung up implementation details of emacs.

Hung up? No. It's the program whose text model I know best, and even if its design could theoretically be a lot better for this purpose, I can't say I've seen a real program whose model is obviously better for the purpose of a language for implementing text editors.[1] So it's not obvious to me that its model can be ruled out on a priori grounds. If not, it would be nice if your new language could implement it efficiently without contorted programming.

...

But yes, positions should be stored as an byte offset into the utf8 string. NOT as number of codepoints since the beginning of the string. Probably you want it to be somewhat opaque, so that you actually have to specify whether you wanted to go to +1 byte, codepoint, or grapheme.

Well, first of all, +1 byte should not be available to a text iterator, at least not with the same iterator/position object that implements character and/or grapheme movement. (You seem to have thought about this issue a lot, but mixing bytes with text units makes wonder how much practical implementation you've done.) Second, incrementing to grapheme boundaries is relatively easy to do efficiently, just as incrementing to a UTF-8 character boundary is easy to do. We already do the latter, the former is pragmatically harder, but not a conceptual stretch. That's not the question. The question is how do we identify an arbitrary position in the text? Sometimes it's nice to have a numerical measure of size or location. It is not obvious that position by grapheme count is going to be the obvious way to determine position in a text. Eg, for languages with variable metric characters, character counts as a way of lining up table columns is going the way of Tyrannosaurus. In the Han-using languages, yes, column counts within lines are going to be important forever, because the characters are literally square for most practical purposes ... but they don't use composing characters (all the Japanese kana are precomposed, for example), so position by grapheme is going to be very close to position by character, and fine positioning will be done either by mouse or by incrementing the last few characters. Nor do I think operations like "advance 1,000,000 characters" will have less meaning than "advance 1,000,000 graphemes." Both of them are just a way of saying "go way far away", end up in about the same place, and where there's a bias, it will be pretty consistent in a statistical sense for any given natural language (and therefore, for 99% of users).

...

But once you [the language implementor] are providing correct abstractions for grapheme movement, it's just as easy to also provide an abstraction for codepoint movement, and make your low-level implementation of the iterator object be a byte-offset into a UTF8 buffer.

Sure, that's fine for something that just iterates over the text. But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive. You end up proliferating types that all do the same kind of thing. Judicious use of inheritance helps, but getting the fundamental abstraction right is hard. Or least, Emacs hasn't found it in 20 years of trying. OTOH, all that stuff "just works" and just works efficiently, up to the grapheme vs. character issue, with an array. About that issue, to go back to tired old Emacs, *all* of the things I can think of that I might want to do by grapheme (display, insert, delete, move a few places) do fit the "increment until done" model. These things already work quite well for the variable-width buffer that "multilingual" Emacsen use, whether the old Mule encoding or UTF-8. So I can see how the UTF-8 model with appropriate iterators for characters and graphemes can work well for lots of applications and use cases. But Emacs already has opaque "markers", yet nevertheless the use of integer character positions in strings and buffers has survived. That *may* have to do with mutability, and the "all the world is a buffer" design, as Glyph suggested, but I think it more likely that markers are very expense to create and use compared to integers. Perhaps an editor of power similar to Emacs could be implemented with string operations on lines, or the like, and these issues would go away. But it's not obvious to me. Footnotes: [1] Yes, I know that not all programs are text editors. So shoot me. It's still the text manipulation program I know best, and it's not obvious to me that it's the unique class that would need these features.

Greg Ewing

6:35 p.m.

On 24/11/10 22:03, Stephen J. Turnbull wrote:

...

But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive.

If the internal representation of a text pointer (I won't call it an iterator because that means something else in Python) is a byte offset or something similar, it shouldn't take up any more space than a Python int, which is what you'd be using anyway if you represented text positions by grapheme indexes or whatever. If you want the text pointer to also remember which string it points into, it'll be a bit bigger, but again, no bigger than you would need to get the same functionality using a grapheme index plus a reference to the original string. Probably smaller, because it would all be encapsulated in one object. So I don't really see what you're arguing for here. How do *you* think positions in unicode strings should be represented? -- Greg

Stephen J. Turnbull

7:55 p.m.

Greg Ewing writes:

...

On 24/11/10 22:03, Stephen J. Turnbull wrote:

...
But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive.

If the internal representation of a text pointer (I won't call it an iterator because that means something else in Python) is a byte offset or something similar, it shouldn't take up any more space than a Python int, which is what you'd be using anyway if you represented text positions by grapheme indexes or whatever.

That's not necessarily true. Eg, in Emacs ("there you go again"), Lisp integers are not only immediate (saving one pointer), but the type is encoded in the lower bits, so that there is no need for a type pointer -- the representation is smaller than the opaque marker type. Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of 24 bytes on a 64-bit platform. In Python it's true that markers can use the same data structure as integers and simply provide different methods, and it's arguable that Python's design is better. But if you use bytes internally, then you have problems. Do you expose that byte value to the user? Can users (programmers using the language and end users) specify positions in terms of byte values? If so, what do you do if the user specifies a byte value that points into a multibyte character? What if the user wants to specify position by number of characters? Can you translate efficiently? As I say elsewhere, it's possible that there really never is a need to efficiently specify an absolute position in a large text as a character (grapheme, whatever) count. But I think it would be hard to implement an efficient text-processing *language*, eg, a Python module for *full conformance* in handling Unicode, on top of UTF-8. Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation. At least, that's been my experience with Emacsen.

...

So I don't really see what you're arguing for here. How do *you* think positions in unicode strings should be represented?

I think what users should see is character positions, and they should be able to specify them numerically as well as via an opaque marker object. I don't care whether that position is represented as bytes or characters internally, except that the experience of Emacsen is that representation as byte positions is both inefficient and fragile. The representation as character positions is more robust but slightly more inefficient.

Glyph Lefkowitz

25 Nov 25 Nov

11:51 p.m.

On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote:

...

Greg Ewing writes:

...
On 24/11/10 22:03, Stephen J. Turnbull wrote:

...
But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive.

If the internal representation of a text pointer (I won't call it an iterator because that means something else in Python) is a byte offset or something similar, it shouldn't take up any more space than a Python int, which is what you'd be using anyway if you represented text positions by grapheme indexes or whatever.

That's not necessarily true. Eg, in Emacs ("there you go again"), Lisp integers are not only immediate (saving one pointer), but the type is encoded in the lower bits, so that there is no need for a type pointer -- the representation is smaller than the opaque marker type. Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of 24 bytes on a 64-bit platform.

Yes, yes, lisp is very clever. Maybe some other runtime, like PyPy, could make this optimization. But I don't think that anyone is filling up main memory with gigantic piles of character indexes and need to squeeze out that extra couple of bytes of memory on such a tiny object. Plus, this would allow such a user to stop copying the character data itself just to decode it, and on mostly-ascii UTF-8 text (a common use-case) this is a 2x savings right off the bat.

...

In Python it's true that markers can use the same data structure as integers and simply provide different methods, and it's arguable that Python's design is better. But if you use bytes internally, then you have problems.

No, you just have design questions.

...

Do you expose that byte value to the user?

Yes, but only if they ask for it. It's useful for computing things like quota and the like.

...

Can users (programmers using the language and end users) specify positions in terms of byte values?

Sure, why not?

...

If so, what do you do if the user specifies a byte value that points into a multibyte character?

Go to the beginning of the multibyte character. Report that position; if the user then asks the requested marker object for its position, it will report that byte offset, not the originally-requested one. (Obviously, do the same thing for surrogate pair code points.)

...

What if the user wants to specify position by number of characters?

Part of the point that we are trying to make here is that nobody really cares about that use-case. In order to know anything useful about a position in a text, you have to have traversed to that location in the text. You can remember interesting things like the offsets of starts of lines, or the x/y positions of characters.

...

Can you translate efficiently?

No, because there's no point :). But you _could_ implement an overlay that cached things like the beginning of lines, or the x/y positions of interesting characters.

...

As I say elsewhere, it's possible that there really never is a need to efficiently specify an absolute position in a large text as a character (grapheme, whatever) count.

...

But I think it would be hard to implement an efficient text-processing *language*, eg, a Python module for *full conformance* in handling Unicode, on top of UTF-8.

Still: why? I guess if I have some free time I'll try my hand at it, and maybe I'll run into a wall and realize you're right :).

...

Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation. At least, that's been my experience with Emacsen.

What sort of algorithm would that be, though? The main thing that I could think of is a text editor trying to efficiently allow the user to scroll to the middle of a large file without reading the whole thing into memory. But, in that case, you could use byte-positions to estimate, and display an heuristic number while calculating the real line numbers. (This is what 'less' does, and it seems to work well.)

...

...
So I don't really see what you're arguing for here. How do *you* think positions in unicode strings should be represented?

I think what users should see is character positions, and they should be able to specify them numerically as well as via an opaque marker object. I don't care whether that position is represented as bytes or characters internally, except that the experience of Emacsen is that representation as byte positions is both inefficient and fragile. The representation as character positions is more robust but slightly more inefficient.

Is it really the representation as byte positions which is fragile (i.e. the internal implementation detail), or the exposure of that position to calling code, and the idiomatic usage of that number as an integer?

Stephen J. Turnbull

27 Nov 27 Nov

12:48 a.m.

Glyph Lefkowitz writes:

...

But I don't think that anyone is filling up main memory with gigantic piles of character indexes and need to squeeze out that extra couple of bytes of memory on such a tiny object.

How do you think editors and browsers represent the regions that they highlight, then? How do you think that structure-oriented editors represent the structures that they work with, then? In a detailed analysis of a C or Java file, it's easy to end up with almost 1:2 positions to characters ratio. Note that *buffer* characters are typically smaller than a platform word, so saving one word in the representation of a position mean a 100% or more increase in the character count of the buffer. Even in the case of UCS-4 on a 32-bit platform, that's a 50% increase in the maximum usable size of a buffer before a parser starts raising OOM errors. There are two plausible ways to represent these structures that I can think of offhand. The first is to do it the way Emacs does, by reading the text into a buffer and using position offsets to map to display or structure attributes. The second is to use a hierarchical document model, and render the display by traversing the document hierarchy. It's not obvious to me that forcing use of the second representation is a good idea for performance in an editor, and I would think that they have similar memory requirements.

...

Plus, this would allow such a user to stop copying the character data itself just to decode it, and on mostly-ascii UTF-8 text (a common use-case) this is a 2x savings right off the bat.

Which only matters if you're a server in the business of shoveling octets really fast but are CPU bound (seems unlikely to me, but I'm no expert; WDYT?), and even then is only that big a savings if you can push off the issue of validating the purported UTF-8 text on others. If you're not validating, you may as well acknowledge that you're processing binary data, not text.[1] But we're talking about text. And of course, if you copy mostly-Han UTF-8 text (a common use-case) to UCS-2, this is a 1.5x memory savings right off the bat, and a 3x time savings when iterating in most architectures (one increment operation per character instead of three). As I've already said, I don't think this is an argument in favor of either representation. Sometimes one wins, sometimes the other. I don't think supplying both is a great idea, although I've proposed it myself for XEmacs (but made as opaque as possible).

...

...
In Python it's true that markers can use the same data structure as integers and simply provide different methods, and it's arguable that Python's design is better. But if you use bytes internally, then you have problems.

No, you just have design questions.

Call them what you like, they're as yet unanswered. In any given editing scenario, I'd concede that it's a "SMOD". But if you're designing a language for text processing, it's a restriction that I believe to be a hindrance to applications. Many applications may prefer to use a straightforward array implementation of text and focus their design efforts on the real problems of their use cases.

...

...
Do you expose that byte value to the user? If so, what do you do if the user specifies a byte value that points into a multibyte character?

Go to the beginning of the multibyte character. Report that position; if the user then asks the requested marker object for its position, it will report that byte offset, not the originally-requested one. (Obviously, do the same thing for surrogate pair code points.)

I will guarantee that some use cases will prefer that you go to the beginning of the *next* character. For an obvious example, your algorithm will infloop if you iterate "pos += 1". (And the opposite problem appears for "beginning of next character" combined with "pos -= 1".) Of course this trivial example is easily addressed by saying "the user should be using the character iterator API here", but I expect the issue can arise where that is not an easy answer. Either the API becomes complex, or the user/developers will have to do complex bookkeeping that should be done by the text implementation. Nor is it obvious that surrogate pairs will be present in a UCS-2 representation. Specifically, they can be encoded to single private space characters in almost all applications, at a very small cost in performance.

...

...
What if the user wants to specify position by number of characters?

Part of the point that we are trying to make here is that nobody really cares about that use-case. In order to know anything useful about a position in a text, you have to have traversed to that location in the text.

Binary search of an ordered text is useful. Granted, this particular example can be addressed usefully in terms of byte positions (viz. your example of less), but your basic premise is falsified.

...

You can remember interesting things like the offsets of starts of lines, or the x/y positions of characters.

...
Can you translate efficiently?

No, because there's no point :). But you _could_ implement an overlay that cached things like the beginning of lines, or the x/y positions of interesting characters.

Emacs does, and a lot of effort has gone into it, and it still sucks compared to an array representation. Maybe _you_ _could_ do better, but as yet we haven't managed to pull it off. :-(

...

...
But I think it would be hard to implement an efficient text-processing *language*, eg, a Python module for *full conformance* in handling Unicode, on top of UTF-8.

Still: why? I guess if I have some free time I'll try my hand at it, and maybe I'll run into a wall and realize you're right :).

I'd rather have you make it plausible to me that there's no point in having efficient access to arbitrary character positions. Then maybe you can delegate that implementation to me. :-) But my Emacs experience says otherwise, and IIUC the intuition and/or experience of MAL and Guido says this is not a YAGNI.

...

...
Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation. At least, that's been my experience with Emacsen.

What sort of algorithm would that be, though? The main thing that I could think of is a text editor trying to efficiently allow the user to scroll to the middle of a large file without reading the whole thing into memory.

Reading into memory or not is a red herring, I think. For many legacy encodings you have to pretty much read the whole thing because they are stateful, and it's just not very expensive compared to the text processing itself (unless your application is shoveling octets as fast as possible, in which case character positions are indeed a YAGNI). The question is whether opaque markers are always sufficient. For example, XEmacs does use byte positions internally for markers and extents (objects representing regions of text that can carry arbitrary properties but are tuned for display properties). Obviously, we have the marker objects you propose as sufficient, and indeed the representation is as efficient as you claim. However, these positions are not exposed as integers to end users, Lisp, or even most of the C code. If a client (end user or code) requests a position, they get a character position. Such requests are frequent enough that they constitute a major drag on many practical applications. It may be that this is unnecessary, as less shows for its application. But less is not an editor, let alone a language for writing editors. Do you know of an editor language of power comparable to Emacs Lisp that is not based on an array representation of text?

...

Is it really the representation as byte positions which is fragile (i.e. the internal implementation detail), or the exposure of that position to calling code, and the idiomatic usage of that number as an integer?

It's the latter. Sufficient effort can make it safe to use byte positions, and the effort is not all that great as long as you don't demand efficiency. XEmacs vs. Emacs implementation of Mule demonstrates that. We at XEmacs never did expose byte positions to even the C code (other than to buffer and string methods), and that implementation has not had to change much, if at all, in 15 years. The caching mechanism to make character position access reasonably efficient, however, has been buggy and not so efficient, and so complex that RMS said "I was going to implement your [position cache] in Emacs but it was too hard for me to understand". (OTOH, the alternative Emacs had implemented turned out to be O(n**2) or worse, so he had to replace it. Translating byte positions to character positions seems to be a real loser.) Emacs did expose byte positions for efficiency reasons, and has had at least four regressions of the "\201 bug". "\201" prefixes a Latin-1 character in internal code, and code that treated byte positions would often result in this being duplicated because all trailing bytes in Mule code are also Latin-1 code points. (Don't ask me about the exact mechanism, XEmacs's implementation is quite different and never suffered from this bug.) Note that a \201-like bug is very unlikely to occur in Python's UCS-2 representation because the semantics of surrogate values in Unicode is unambiguous. However, I believe similar bugs would be possible in a UTF-8 representation -- if code is allowed to choose whether to view UTF-8 in binary or text mode -- because trailing byte values are Latin-1 code points. Maybe I'm just an old granny, scared of my shadow.<wink> Footnotes: [1] I have no objection to providing "text" algorithms (such as regexps) for use on "binary" data. But then they don't provide any guarantees that transformations of purported text remains text.

Glyph Lefkowitz

25 Nov 25 Nov

11:21 p.m.

On Nov 24, 2010, at 4:03 AM, Stephen J. Turnbull wrote:

...

You end up proliferating types that all do the same kind of thing. Judicious use of inheritance helps, but getting the fundamental abstraction right is hard. Or least, Emacs hasn't found it in 20 years of trying.

Emacs hasn't even figured out how to do general purpose iteration in 20 years of trying either. The easiest way I've found to loop across an arbitrary pile of 'stuff' is the CL 'loop' macro, which you're not even supposed to use. Even then, you still have to make the arcane and pointless distinction of using 'across' or 'in' or 'on'. Python, on the other hand, has iteration pretty well tied up nicely in a bow. I don't know how to respond to the rest of your argument. Nothing you've said has in any way indicated to me why having code-point offsets is a good idea, only that people who know C and elisp would rather sling around piles of integers than have good abstract types. For example:

...

I think it more likely that markers are very expense to create and use compared to integers.

What? When you do 'for x in str' in python, you are already creating an iterator object, which has to store the exact same amount of state that our proposed 'marker' or 'character pointer' would have to store. The proposed UTF-8 marker would have to do a tiny bit more work when iterating because it would have to combine multibyte characters, but in exchange for that you get to skip a whole ton of copying when encoding and decoding. How is this expensive to create and use? For every application I have ever designed, encountered, or can even conjecture about, this would be cheaper. (Assuming not just a UTF-8 string type, but one for UTF-16 as well, where native data is in that format already.) For what it's worth, not wanting to use abstract types in Emacs makes sense to me: I've written my share of elisp code, and it is hard to create reasonable abstractions in Emacs, because the facilities for defining types and creating polymorphic logic are so crude. It's a lot easier to just assume your underlying storage is an array, because at the end of the day you're going to need to call some functions on it which care whether it's an array or an alist or a list or a vector anyway, so you might as well just say so up front. But in Python we could just call 'mystring.by_character()' or 'mystring.by_codepoint()' and get an iterator object back and forget about all that junk.

James Y Knight

23 Nov 23 Nov

10:27 p.m.

On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:

...

By the way, to send the ball back into your court, I have this feeling that the demand for UTF-8 is once again driven by native English speakers who are very shortly going to find themselves, and the data they are most familiar with, very much in the minority. Of course the market that benefits from UTF-8 compression will remain very large for the immediate future, but in the grand scheme of things, most of the world is going to prefer UTF-16 by a substantial margin.

No, the demand for UTF-8 is because that's what much of the internet (and not coincidentally, unix) world has standardized on. The main pieces of software using UTF-16 (Windows, Java) started doing so before it became apparent that 16 bits wasn't enough to actually hold a unicode codepoint, so they were actually implementing UCS-2. In those days, UCS-2 was a fairly sensible choice. But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior. Not because it's smaller -- it's pretty much a tossup -- but because it is an ASCII superset, and thus more easily compatible with other software. That also makes it most commonly used for internet communication. (So, there's a huge advantage for using it internally as well right there: no transcoding necessary for writing your HTML output). UTF-16 is incompatible with ASCII, and furthermore, it's still a variable-width encoding, with all the same issues that causes. As such, there's really very little to be said in favor of it. If you really want a fixed-width encoding, you have to go to UTF-32, which is excessively large. UTF-32 is a losing choice, simply because of the wasted memory usage. But that's all a side issue: even if you do choose UTF-16 as your underlying encoding, you *still* need to provide iterators that work by "byte" (only now bytes are 16-bits), by codepoint, and by grapheme. Of course, people who implement UTF-16 (such as python, java, and windows) often pretend they're still implementing UCS-2, and don't bother even providing their users with the necessary APIs to do things correctly. Which, you can often get away with...just so long as you don't mind that you sometimes end up splitting a string in the middle of a codepoint and causing a unicode error! James

Stephen J. Turnbull

24 Nov 24 Nov

1:51 a.m.

James Y Knight writes:

...

But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior [...]a because it is an ASCII superset, and thus more easily compatible with other software. That also makes it most commonly used for internet communication.

Sure, UTF-8 is very nice as a protocol for communicating text. So what? If your application involves shoveling octets real fast, don't convert and shovel those octets. If your application involves significant text processing, well, conversion can almost always be done as fast as you can do I/O so it doesn't cost wallclock time, and generally doesn't require a huge percentage of CPU time compared to the actual text processing. It's just a specialization of serialization, that we do all the time for more complex data structures. So wire protocols are not a killer argument for or against any particular internal representation of text.

...

(So, there's a huge advantage for using it internally as well right there: no transcoding necessary for writing your HTML output).

I don't know your use cases but for mine, transcoding (whether in Lisp or Python or C) is invariably the least of my worries. *Especially* transcoding to UTF-8, which is the default codec for me, and I *never* mix bytes and text, so having not bothered to set the codec, I don't bother to transcode explicitly.

...

If you really want a fixed-width encoding, you have to go to UTF-32

Not really. I never bothered implementing the codec, because I haven't yet seen a non-BMP Unicode character in the wild (I still see a lot of non-Unicode characters, but hey, that's the price you pay for living in the land that invented sushi, sake, and anime). For most use cases, those are going to be rare, where by "rare" I mean "you aren't going to see 6400 *different* non-BMP characters."[1] So instead of having the codec produce UTF-16, you have it produce (Holy CEF, Batman!) "pure" UCS-2 with the non-BMP characters registered on demand and encoded in the BMP private area. Python, of course, will never know the difference, and your language won't need to care, either.

...

But that's all a side issue: even if you do choose UTF-16 as your underlying encoding, you *still* need to provide iterators that work by "byte" (only now bytes are 16-bits), by codepoint,

Nope, see above. Codepoints can be bytes and vice versa. The needed codec is no harder to use than any other codec, and only slightly less efficient than the normal UTF-8 codec unless you're basically restricted to a rather uncommon script (and even then there are optimizations).

...

and by grapheme.

Sure, but as I point out elsewhere, the use cases where grapheme movement is distinguished from character movement I can come up with are all iterative, and I don't need array behavior for both anyway. So since I *can* have a character array in Unicode, and I *can't* have a grapheme array (except maybe by a scheme like the above), I'll go for the character array. Unless maybe you convince me I don't need it, but I'm yet to be convinced.

...

away with...just so long as you don't mind that you sometimes end up splitting a string in the middle of a codepoint and causing a unicode error!

I *do* mind, but I like Python anyway.<wink> Footnotes: [1] OK, in practice a lot of the private space will be taken by existing system characters, such as the Apple logo (absolutely essential for writing email on Mac, at least in Japan). Whose use-case is going to see 1000 different non-BMP characters in a session? I do know a couple of Buddhist dictionary editors, but aside from them, I can't think of anybody. Lara Croft, maybe.

Antoine Pitrou

2:27 a.m.

On Wed, 24 Nov 2010 18:51:49 +0900 "Stephen J. Turnbull" wrote:

...

James Y Knight writes:

...
But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior [...]a because it is an ASCII superset, and thus more easily compatible with other software. That also makes it most commonly used for internet communication.

Sure, UTF-8 is very nice as a protocol for communicating text. So what? If your application involves shoveling octets real fast, don't convert and shovel those octets. If your application involves significant text processing, well, conversion can almost always be done as fast as you can do I/O so it doesn't cost wallclock time, and generally doesn't require a huge percentage of CPU time compared to the actual text processing. It's just a specialization of serialization, that we do all the time for more complex data structures.

So wire protocols are not a killer argument for or against any particular internal representation of text.

Agreed. Decoding and encoding utf-8 is so fast that it should be dwarfed by any actual processing done on the text. Regards Antoine.

Greg Ewing

4:19 p.m.

On 24/11/10 13:22, James Y Knight wrote:

...

Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme

Maybe it would be a good idea to add some iterators like this to Python. (Or has the time machine beaten me there?) -- Greg

Stephen J. Turnbull

23 Nov 23 Nov

6:29 p.m.

Alexander Belopolsky writes:

...

Yet finding a bug in a str object method after a 5 min review was a bit discouraging:

...
...
...
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: The fill character must be exactly one character long

Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them.

"Consenting adults" applies here. What to do? Write tests, fix the stdlib. Raise the probability of surrogate pair tests in the fuzzer. But "expose the users to surrogate pairs in an efficient (ie, UCS-2) implementation" is a fundamental design principle of Python. Tightening up the internal implementation is -10 unacceptable IMO YMMV.

...

Again, given that the str object itself has at least one non-BMP character bug as we are closing on the third major release of py3k, how likely are 3rd party developers to get their libraries right as they port to 3.x?

Not our problem, really. We need to fix the stdlib, but 3rd party libraries know what they're doing. I guess we could provide a fuzztest module that generates known nasty data (zero, very big numbers, "\0x00", "\U00010140", etc) that people would be able to plug in as a data source for their own code. Of course that doesn't replace conventional unittests based on analysis of edge cases and tests designed to tickle them, but it would be a start for many projects.

Stephen J. Turnbull

21 Nov 21 Nov

10:14 p.m.

R. David Murray writes:

...

I'm sorry, but I have to disagree. As a relative unicode ignoramus, "UCS-2" and "UCS-4" convey almost no information to me, and the bits I have heard about them on this list have only confused me.

OK, point taken.

...

On the other hand, I understand that 'narrow' means that fewer bytes are used for each internal character, meaning that some unicode characters need to be represented by more than one string element, and thus that slicing strings containing such characters on a narrow build causes problems. Now, you could tell me the same information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow' and 'wide', but to my ear 'narrow' and 'wide' convey a better gut level feeling for what is going on than 'UCS-2' and 'UCS-4' do.

I think that is probably conditioned by your long experience with Python's Unicode features, specifically the knowledge that Python's Unicode strings are not arrays of characters, which often is referred to on this list. My guess is that very few newbies would know that, and it is not implied by "narrow". For example, both Emacs (for sure) and Perl (IIUC) index strings of variable-width character by characters (at great expense of performance in Emacs, at least), not as code units.

...

And it avoids any question of whether or not Python's internal representation actually conforms to whatever standard it is that UCS refers to, a point on which there seems to be some dissension.

UCS-2 refers to ISO 10646, Annex 1 IIRC.[1] Anyway, it's somewhere in ISO 10646. I don't think there's actually dissension on conformance to UCS-2, as that's very easy to achieve. Rather, Guido explicitly pronounced that Python processes arrays of code units, not characters. My point is that if you pretend that Python is processing *characters* according to UCS-2 rules for characters, you'll always come to the same conclusion about what Python will do as if you use the technically correct terminology of code units. (At least for the BMP and UTF-16 private areas. There will necessarily be some confusion about surrogates, since in UCS-2 they are characters while in UTF-16 they're merely "code points", and the Unicode characters they represent can't be represented at all in UCS-2.)

...

Indeed, reading that article with my limited unicode knowledge, if I were told Python used UCS-2, I would assume that non-BMP characters could not be processed by a Python narrow build.

Actually, I'm almost happy with that. That is, the precise formulation is "could not be processed *safely without extra care* by a Python narrow build." Specifically, AFAIK if you range check characters that have been indexed out of a string, or are located at slice boundaries, or produced by chr() or a surrogateescape input codec, you're safe. But practically speaking few apps will actually do those checks and therefore they are unsafe: processing non-BMP characters can easily lead to show-stopping Exceptions. It's very analogous to the kind of show-stopping "bad character in a header" exception that plagued Mailman for so long, and had to be fixed on a case-by-case basis. But the restriction to BMP characters is much more reasonable (at least for now) than RFC 822's restriction to ASCII! But evidently you take it much more stringently. So the question is, "what fraction of developers who think as you do would therefore be put off from using Python to build their applications?" If most would say "OK, we'll stick with BMP for now and use UCS-4 or some hack to deal with extended characters later -- it can't really be true that it's absolutely impossible to use non-BMP characters," I don't mind that misunderstanding. OTOH, yes, it would be bad if the use of "UCS-2" were to imply to more than a couple of developers that 16-bit builds of Python can't handle UTF-16 *at all*. Footnotes: [1] It simply says "we have a subset of the Unicode character set all of whose code points can be represented in 16 bits, excluding 0xFFFF." It goes on to define a private area, reserved for use by applications that will never be standardized, and it says that if you don't know what a code point in the character area is, don't change it (you can delete it, however). ISTR that a later Amendment added 0xFFFE to the short-list of non-characters. The surrogate area was taken out of the private area, so a UCS-2 application will simply consider each surrogate to be an unknown character and pass it through unchanged -- unless it deletes it, or inserts other characters between the code points of a surrogate pair. And that's why UCS-2 isn't UTF-16 conforming -- which is basically why Python isn't either.

"Martin v. Löwis"

10:51 a.m.

...

...
I disagree. Python does "conform" to "UTF-16"

I'm sure the codecs do. But the Unicode standard doesn't care about the parts of the process, it cares about what it does as a whole.

Chapter and verse?

...

Python's internal coding does not conform to UTF-16, and that internal coding can, under certain conditions, escape to the outside world as invalid "Unicode" output.

I'm fairly certain there are provisions in the Unicode standard for such behavior (taking into account "certain conditions").

...

...
What behavior specifically do you consider non-conforming, and what specific specification do you think it is not conforming to? For example, it *is* fully conforming with UTF-8.

Oh,

f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape') f.write(chr(int('dc80',16))) f.close()

for one. That produces a non-UTF-8 file

Right. You are using an API that does not promise to create UTF-8, and hence isn't UTF-8. The Unicode standard certainly allows implementations to use character encoding schemes other than UTF-8; this one being "UTF-8 with surrogate escapes", which is different from "UTF-8" (IANA MIBEnum 106).

...

You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.

See above :-)

...

Nevertheless, the program is able to produce output from internal "Unicode" strings that does not conform to Unicode at all.

*Any* Unicode implementation will do that, since they all have to support legacy encodings in some form. This is certainly conforming to the Unicode standard, and in fact one of the primary Unicode design principles.

...

A Unicode- conforming Python implementation would error at the chr() call, or perhaps would not provide surrogateescape error handlers.

Chapter and verse?

...

"Although practicality beats purity."

The Unicode standard itself is based on practicality. It wouldn't have received the success it did if it was based on purity only (and indeed, was often rejected in cases where it put purity over practicality, e.g. with the Hangul syllables). Regards, Martin

Stephen J. Turnbull

9:28 p.m.

"Martin v. Löwis" writes:

...

Chapter and verse?

Unicode 5.0, Chapter 3, verse C9: When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences. I think anything called "UTF-8 something" is likely to be taken to "purport". Furthermore, users don't necessarily see which error handlers are being used. A user who specifies "utf8" as the output codec is likely to be rather surprised if non-UTF-8 is emitted because the app specified surrogateescape. Eg, consider a script which munges file descriptions into reasonable-length file names on Unix. Yes, technically the non-Unicode output is the app's fault, but I expect many users will put some blame on Python. I am in full agreement with you about the technicalities, but I am looking for ways to clue in users that (a) the technicalities matter, and (b) that Python does a *very* good job of making things as safe as possible without becoming unable to handle bytes. I think "wide" vs. "narrow" fails at both. It focuses on storage issues, which of course are important, but at the cost of ignoring the fact that for users of non-BMP characters 32-bit code units are much safer. Users who need non-BMP characters are relatively few, and at least at the present time most are painfully aware of the need to care for technicalities. I expect them to be pleasantly surprised by how easy it is to get reasonably safe behavior even from a 16-bit build.

...

...
Python's internal coding does not conform to UTF-16, and that internal coding can, under certain conditions, escape to the outside world as invalid "Unicode" output.

I'm fairly certain there are provisions in the Unicode standard for such behavior (taking into account "certain conditions").

Sure. There's nothing in the Unicode standard that says you have to conform to it unless you claim to conform to it. So it is valid to say that Python's Unicode codecs without surrogateescape do conform. The point is that Python does not, even if all of the input is valid Unicode, because of the provision of surrogateescape and the lack of Unicode conformance-checking for certain internal functionality like chr() and slicing. You can say "we don't make any such claim", but IMO the distinction in question is too fine a point for most users, and requires a very large amount of Unicode knowledge (not to mention standards geekiness) to even understand the precise statement. "Unicode support" to users should mean that Python does the right thing, not that if you look hard enough in the documentation you will discover that Python doesn't claim to do the right thing even though in practice it mostly does. IMO, "UCS-2" is a pretty good description of what the user can leave up to Python in perfect safety. RDM's reply worries me a little, but I'll reply to his message separately.

...

*Any* Unicode implementation will do that, since they all have to support legacy encodings in some form. This is certainly conforming to the Unicode standard, and in fact one of the primary Unicode design principles.

No. Support for legacy encodings takes you outside of the realm of Unicode conformance by definition. Their names tell you that, however. "UTF-8 with surrogate escapes" on the other hand is an entirely different kettle of fish. It pretends to be UTF-8, but isn't. I think that users who give Python valid input should be able to expect valid output, but they can't. Chapter 3, verse C7: When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences, or the deletion of *noncharacter* code points. Sure, you can tell users the truth: "Python may modify your Unicode characters if you slice or index Unicode strings. It may even silently turn them into invalid codes which will eventually raise Errors." Then you are conformant, but why would anyone want to use such a program? If you tell them "UCS-2[sic] Python is safe to use with *no* extra care if you use only UCS-2 [or BMP] characters", suddenly Python looks very nice indeed again. "UCS-4" Python is even better; all you have to do is to avoid surrogateescape codecs. However, you're still vulnerable to hard-to-diagnose errors at the output stage in case of program bugs, because not enough checking of values is done by Python itself.

...

...
A Unicode-conforming Python implementation would error at the chr() call, or perhaps would not provide surrogateescape error handlers.

Chapter and verse?

Chapter 3, verse C9 again.

...

...
"Although practicality beats purity."

The Unicode standard itself is based on practicality. It wouldn't have received the success it did if it was based on purity only (and indeed, was often rejected in cases where it put purity over practicality, e.g. with the Hangul syllables).

Python practicality is very different from Unicode practicality.

"Martin v. Löwis"

22 Nov 22 Nov

12:20 a.m.

...

Unicode 5.0, Chapter 3, verse C9:

When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences.

...

...
...
A Unicode-conforming Python implementation would error at the chr() call, or perhaps would not provide surrogateescape error handlers.

Chapter and verse?

Chapter 3, verse C9 again.

I agree that the surrogateescape error handler is non-conforming, but, as you say, it doesn't claim to, either (would your concern about utf-8 being misleading here been resolved if the thing had been called "utf-8b"?) More interestingly (and to the subject) is chr: how did you arrive at C9 banning Python3's definition of chr? This chr function puts the code sequence into well-formed UTF-16; that's the whole point of UTF-16. Regards, Martin

Stephen J. Turnbull

2:47 a.m.

"Martin v. Löwis" writes:

...

More interestingly (and to the subject) is chr: how did you arrive at C9 banning Python3's definition of chr? This chr function puts the code sequence into well-formed UTF-16; that's the whole point of UTF-16.

No, it doesn't, in the specific case of surrogate code points. In 3.1.2 from MacPorts on a iBook G4 and from Gentoo on AMD64, chr(0xd800) returns "\ud800". I don't know if that's by design (eg, so that it can be used in the implementation of the surrogateescape error handler) or a correctable oversight, but it's not conformant.

"Martin v. Löwis"

3:22 a.m.

Am 22.11.2010 11:47, schrieb Stephen J. Turnbull:

...

"Martin v. Löwis" writes:

...
More interestingly (and to the subject) is chr: how did you arrive at C9 banning Python3's definition of chr? This chr function puts the code sequence into well-formed UTF-16; that's the whole point of UTF-16.

No, it doesn't, in the specific case of surrogate code points. In 3.1.2 from MacPorts on a iBook G4 and from Gentoo on AMD64, chr(0xd800) returns "\ud800".

Ah, I see - this is *not* the subject's issue, right?

...

I don't know if that's by design (eg, so that it can be used in the implementation of the surrogateescape error handler) or a correctable oversight, but it's not conformant.

I disagree: Quoting from Unicode 5.0, section 5.4: # The individual components of implementations may have different # levels of support for surrogates, as long as those components are # assembled and communicate correctly. Low-level string processing, # where a Unicode string is not interpreted but is handled simply as an # array of code units, may ignore surrogate pairs. With such strings, # for example, a truncation operation with an arbitrary offset might # break a surrogate pair. (For further discussion, see Section 2.7, # Unicode Strings.) For performance in string operations, such behavior # is reasonable at a low level, but it requires higher-level processes # to ensure that offsets are on character boundaries so as to guarantee # the integrity of surrogate pairs. So lower-level routines (which I claim chr() is one) are allowed to create lone surrogates. The formal requirement behind this is C1: # A process shall not interpret a high-surrogate code point or a # low-surrogate code point as an abstract character. I also claim that Python, in both narrow and wide mode, conforms to this requirement. Notice that the requirement is a ban on interpreting the code point as a character. In particular, unicodedata.category claims that the code point is of class Cs (surrogate), which I consider conforming. By the same line of reasoning, it is also OK that chr() allows the creation of unassigned code points, even though C2 says that they must not be interpreted as abstract characters. The rationale for supporting these characters in chr() goes back much further than the surrogateescape handler - as Python unicode strings are sequences of code points, it would be impractical if you couldn't create some of them, or even would have to consult the UCD before determining whether they can be created. Regards, Martin

Stephen J. Turnbull

23 Nov 23 Nov

8:16 a.m.

"Martin v. Löwis" writes:

...

I disagree: Quoting from Unicode 5.0, section 5.4:

# The individual components of implementations may have different # levels of support for surrogates, as long as those components are # assembled and communicate correctly.

"Assembly" is the problem. If chr() or a slice creates a lone surrogate and surrogateescape passes it back out, Python as a whole is non-conforming. Technically, you can hide behind "none of slicing, chr(), or surrogateescape promises to conform", and maybe that would fly to a standards lawyer; I'd have to see the precise statement. Here's a more convincing example. A user specifies "utf8" as her locale charset. Then she specifies a string containing a non-BMP character as the "description" of a file, and internal code munges this via slicing into a file name conforming to some specification (eg, length limit + uniquifier if needed). Then if the non-BMP character is in the "right" place, she will get either a broken file name, which will either get written to disk or raise an exception, depending on whether the munging program has enabled surrogateescape or not. I claim both of those results are non-conforming to the specification of UTF-16, and therefore Python Unicode processing as a whole must be considered non-conforming. It's still pretty damn good. But I've elaborated that point elsewhere.

...

The rationale for supporting these characters in chr() goes back much further than the surrogateescape handler - as Python unicode strings are sequences of code points, it would be impractical if you couldn't create some of them, or even would have to consult the UCD before determining whether they can be created.

The Zen is irrelevant to determining conformance to Unicode, which has its own Zen.

Victor Stinner

25 Nov 25 Nov

1:39 p.m.

On Friday 19 November 2010 23:25:03 you wrote:

...

...
Python is unclear about non-BMP characters: narrow build was called "ucs2" for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words).

No, no, no :-)

UCS2 and UCS4 are more appropriate than "narrow" and "wide" or even "UTF-16" and "UTF-32".

Ok for Python 2: $ ./python Python 2.7.0+ (release27-maint:84618M, Sep 8 2010, 12:43:49)

...

...
...
import sys; sys.maxunicode 65535 x=u'\U0010ffff'; len(x) 2 ord(x) ... TypeError: ord() expected a character, but string of length 2 found

But Python 3 does use UTF-16 for narrow build: $ ./python Python 3.2a3+ (py3k:86396:86399M, Nov 10 2010, 15:24:09)

...

...
...
import sys; sys.maxunicode 65535 c=chr(0x10ffff); len(c) 2 ord(c) 1114111

Victor

4892

Age (days ago)

4900

Last active (days ago)

List overview

Download

61 comments

14 participants

participants (14)

"Martin v. Löwis"
Alexander Belopolsky
Antoine Pitrou
Glyph Lefkowitz
Greg Ewing
James Y Knight
M.-A. Lemburg
Nick Coghlan
R. David Murray
Raymond Hettinger
Stephen J. Turnbull
Stephen J. Turnbull
Terry Reedy
Victor Stinner

len(chr(i)) = 2?

tags

participants (14)