RE Module Performance
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Sun Jul 28 14:13:29 EDT 2013
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth at gmail.com> wrote:
>
> > Back to utf. utfs are not only elements of a unique set of encoded
>
> > code points. They have an interesting feature. Each "utf chunk"
>
> > holds intrisically the character (in fact the code point) it is
>
> > supposed to represent. In utf-32, the obvious case, it is just
>
> > the code point. In utf-8, that's the first chunk which helps and
>
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>
> > implementation using bytes, for any pointer position it is always
>
> > possible to find the corresponding encoded code point and from this
>
> > the corresponding character without any "programmed" information. See
>
> > my editor example, how to find the char under the caret? In fact,
>
> > a silly example, how can the caret can be positioned or moved, if
>
> > the underlying corresponding encoded code point can not be
>
> > dicerned!
>
>
>
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>
> easy to determine the identity of the code point at that location.
>
> But this is not often a useful operation, save for resynchronization
>
> in the case that the string data is corrupted. The caret of an editor
>
> does not conceptually correspond to a pointer location, but to a
>
> character index. Given a particular character index (e.g. 127504), an
>
> editor must be able to determine the identity and/or the memory
>
> location of the character at that index, and for UTF-8 and UTF-16
>
> without an auxiliary data structure that is a O(n) operation.
>
>
>
> > 2) Take a look at this. Get rid of the overhead.
>
> >
>
> >>>> sys.getsizeof('b'*1000000 + 'c')
>
> > 1000026
>
> >>>> sys.getsizeof('b'*1000000 + '€')
>
> > 2000040
>
> >
>
> > What does it mean? It means that Python has to
>
> > reencode a str every time it is necessary because
>
> > it works with multiple codings.
>
>
>
> Large strings in practical usage do not need to be resized like this
>
> often. Python 3.3 has been in production use for months now, and you
>
> still have yet to produce any real-world application code that
>
> demonstrates a performance regression. If there is no real-world
>
> regression, then there is no problem.
>
>
>
> > 3) Unicode compliance. We know retrospectively, latin-1,
>
> > is was a bad choice. Unusable for 17 European languages.
>
> > Believe of not. 20 years of Unicode of incubation is not
>
> > long enough to learn it. When discussing once with a French
>
> > Python core dev, one with commit access, he did not know one
>
> > can not use latin-1 for the French language!
>
>
>
> Probably because for many French strings, one can. As far as I am
>
> aware, the only characters that are missing from Latin-1 are the Euro
>
> sign (an unfortunate victim of history), the ligature œ (I have no
>
> doubt that many users just type oe anyway), and the rare capital Ÿ
>
> (the miniscule version is present in Latin-1). All French strings
>
> that are fortunate enough to be absent these characters can be
>
> represented in Latin-1 and so will have a 1-byte width in the FSR.
------
latin-1? that's not even truth.
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ü')
38
>>> sys.getsizeof('aa')
27
>>> sys.getsizeof('aü')
39
jmf
More information about the Python-list
mailing list