flaming vs accuracy [was Re: Performance of int/long in Python 3]

Benjamin Kaplan benjamin.kaplan at case.edu
Thu Mar 28 22:52:27 CET 2013

On Thu, Mar 28, 2013 at 2:11 PM, jmfauth <wxjmfauth at gmail.com> wrote:
> On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap... at case.edu> wrote:
>> On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa... at gmail.com> wrote:
>> > On 28 mar, 17:33, Ian Kelly <ian.g.ke... at gmail.com> wrote:
>> >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa... at gmail.com> wrote:
>> >> > The flexible string representation takes the problem from the
>> >> > other side, it attempts to work with the characters by using
>> >> > their representations and it (can only) fails...
>> >> This is false.  As I've pointed out to you before, the FSR does not
>> >> divide characters up by representation.  It divides them up by
>> >> codepoint -- more specifically, by the *bit-width* of the codepoint.
>> >> We call the internal format of the string "ASCII" or "Latin-1" or
>> >> "UCS-2" for conciseness and a point of reference, but fundamentally
>> >> all of the FSR formats are simply byte arrays of *codepoints* -- you
>> >> know, those things you keep harping on.  The major optimization
>> >> performed by the FSR is to consistently truncate the leading zero
>> >> bytes from each codepoint when it is possible to do so safely.  But
>> >> regardless of to what extent this truncation is applied, the string is
>> >> *always* internally just an array of codepoints, and the same
>> >> algorithms apply for all representations.
>> > -----
>> > You know, we can discuss this ad nauseam. What is important
>> > is Unicode.
>> > You have transformed Python back in an ascii oriented product.
>> > If Python had imlemented Unicode correctly, there would
>> > be no difference in using an "a", "é", "€" or any character,
>> > what the narrow builds did.
>> > If I am practically the only one, who speakes /discusses about
>> > this, I can ensure you, this has been noticed.
>> > Now, it's time to prepare the Asparagus, the "jambon cru"
>> > and a good bottle a dry white wine.
>> > jmf
>> You still have yet to explain how Python's string representation is
>> wrong. Just how it isn't optimal for one specific case. Here's how I
>> understand it:
>> 1) Strings are sequences of stuff. Generally, we talk about strings as
>> either sequences of bytes or sequences of characters.
>> 2) Unicode is a format used to represent characters. Therefore,
>> Unicode strings are character strings, not byte strings.
>> 2) Encodings  are functions that map characters to bytes. They
>> typically also define an inverse function that converts from bytes
>> back to characters.
>> 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
>> mentioned in the previous point. It happens to be one of the five
>> standard encodings that is defined for all characters in the Unicode
>> standard (the others being the little and big endian variants of
>> UTF-16 and UTF-32).
>> 4) The internal representation of a character string DOES NOT MATTER.
>> All that matters is that the API represents it as a string of
>> characters, regardless of the representation. We could implement
>> character strings by putting the Unicode code-points in binary-coded
>> decimal and it would be a Unicode character string.
>> 5) The String type that .NET and Java (and unicode type in Python
>> narrow builds) use is not a character string. It is a string of
>> shorts, each of which corresponds to a UTF-16 code point. I know this
>> is the case because in all of these, the length of "\u1f435" is 2 even
>> though it only consists of one character.
>> 6) The new string representation in Python 3.3 can successfully
>> represent all characters in the Unicode standard. The actual number of
>> bytes that each character consumes is invisible to the user.
> ----------
> I shew enough examples. As soon as you are using non latin-1 chars
> your "optimization" just became irrelevant and not only this, you
> are penalized.
> I'm sorry, saying Python now is just covering the whole unicode
> range is not a valuable excuse. I prefer a "correct" version with
> a narrower range of chars, especially if this range represents
> the "daily used chars".
> I can go a step further, if I wish to write an application for
> Western European users, I'm better served if I'm using a coding
> scheme covering all thesee languages/scripts. What about cp1252 [*]?
> Does this not remind somthing?
> Python can do better, it only succeeds to do worth!
> [*] yes, I kwnow, internally ....
> jmf

By that logic, we should all be using ASCII because it's "correct" for
the 127 characters that I (as an English speaker) use, and therefore
it's all that we should care about. I don't care if é counts as two
characters, it's faster and more memory efficient for all of my
strings to just count bytes. There are certain domains where
characters outside the basic multilingual plane are used. Python's job
is to be correct in all of those circumstances, not just the ones you
care about.

More information about the Python-list mailing list