flaming vs accuracy [was Re: Performance of int/long in Python 3]

jmfauth wxjmfauth at gmail.com
Thu Mar 28 22:11:18 CET 2013


On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap... at case.edu> wrote:
> On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa... at gmail.com> wrote:
> > On 28 mar, 17:33, Ian Kelly <ian.g.ke... at gmail.com> wrote:
> >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa... at gmail.com> wrote:
> >> > The flexible string representation takes the problem from the
> >> > other side, it attempts to work with the characters by using
> >> > their representations and it (can only) fails...
>
> >> This is false.  As I've pointed out to you before, the FSR does not
> >> divide characters up by representation.  It divides them up by
> >> codepoint -- more specifically, by the *bit-width* of the codepoint.
> >> We call the internal format of the string "ASCII" or "Latin-1" or
> >> "UCS-2" for conciseness and a point of reference, but fundamentally
> >> all of the FSR formats are simply byte arrays of *codepoints* -- you
> >> know, those things you keep harping on.  The major optimization
> >> performed by the FSR is to consistently truncate the leading zero
> >> bytes from each codepoint when it is possible to do so safely.  But
> >> regardless of to what extent this truncation is applied, the string is
> >> *always* internally just an array of codepoints, and the same
> >> algorithms apply for all representations.
>
> > -----
>
> > You know, we can discuss this ad nauseam. What is important
> > is Unicode.
>
> > You have transformed Python back in an ascii oriented product.
>
> > If Python had imlemented Unicode correctly, there would
> > be no difference in using an "a", "é", "€" or any character,
> > what the narrow builds did.
>
> > If I am practically the only one, who speakes /discusses about
> > this, I can ensure you, this has been noticed.
>
> > Now, it's time to prepare the Asparagus, the "jambon cru"
> > and a good bottle a dry white wine.
>
> > jmf
>
> You still have yet to explain how Python's string representation is
> wrong. Just how it isn't optimal for one specific case. Here's how I
> understand it:
>
> 1) Strings are sequences of stuff. Generally, we talk about strings as
> either sequences of bytes or sequences of characters.
>
> 2) Unicode is a format used to represent characters. Therefore,
> Unicode strings are character strings, not byte strings.
>
> 2) Encodings  are functions that map characters to bytes. They
> typically also define an inverse function that converts from bytes
> back to characters.
>
> 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
> mentioned in the previous point. It happens to be one of the five
> standard encodings that is defined for all characters in the Unicode
> standard (the others being the little and big endian variants of
> UTF-16 and UTF-32).
>
> 4) The internal representation of a character string DOES NOT MATTER.
> All that matters is that the API represents it as a string of
> characters, regardless of the representation. We could implement
> character strings by putting the Unicode code-points in binary-coded
> decimal and it would be a Unicode character string.
>
> 5) The String type that .NET and Java (and unicode type in Python
> narrow builds) use is not a character string. It is a string of
> shorts, each of which corresponds to a UTF-16 code point. I know this
> is the case because in all of these, the length of "\u1f435" is 2 even
> though it only consists of one character.
>
> 6) The new string representation in Python 3.3 can successfully
> represent all characters in the Unicode standard. The actual number of
> bytes that each character consumes is invisible to the user.

----------


I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf



More information about the Python-list mailing list