Flexible string representation, unicode, typography, ...
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Sat Aug 25 11:47:52 EDT 2012
Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
>
> > On 25/08/2012 08:27, wxjmfauth at gmail.com wrote:
>
> >>
>
> >> Unicode design: a flat table of code points, where all code
>
> >> points are "equals".
>
> >> As soon as one attempts to escape from this rule, one has to
>
> >> "pay" for it.
>
> >> The creator of this machinery (flexible string representation)
>
> >> can not even benefit from it in his native language (I think
>
> >> I'm correctly informed).
>
> >>
>
> >> Hint: Google -> "Das grosse Eszett"
>
> >>
>
> >> jmf
>
> >>
>
> >
>
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
>
> > still baffled as to the point if any. Could someone please enlightem me?
>
> >
>
>
>
> Here's what I think he is saying. I am posting this to test the water. I
>
> am also confused, and if I have got it wrong hopefully someone will
>
> correct me.
>
>
>
> In python 3.3, unicode strings are now stored as follows -
>
> if all characters can be represented by 1 byte, the entire string is
>
> composed of 1-byte characters
>
> else if all characters can be represented by 1 or 2 bytea, the entire
>
> string is composed of 2-byte characters
>
> else the entire string is composed of 4-byte characters
>
>
>
> There is an overhead in making this choice, to detect the lowest number
>
> of bytes required.
>
>
>
> jmfauth believes that this only benefits 'english-speaking' users, as
>
> the rest of the world will tend to have strings where at least one
>
> character requires 2 or 4 bytes. So they incur the overhead, without
>
> getting any benefit.
>
>
>
> Therefore, I think he is saying that he would have preferred that python
>
> standardise on 4-byte characters, on the grounds that the saving in
>
> memory does not justify the performance overhead.
>
>
>
> Frank Millman
Very well explained. Thanks.
More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).
Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.
---
For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.
100% Unicode compliant from the day 0. Congratulations.
jmf
More information about the Python-list
mailing list