FSR and unicode compliance - was Re: RE Module Performance
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Sun Jul 28 15:23:04 EDT 2013
Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
> On 07/27/2013 12:21 PM, wxjmfauth at gmail.com wrote:
>
> > Good point. FSR, nice tool for those who wish to teach
>
> > Unicode. It is not every day, one has such an opportunity.
>
>
>
> I had a long e-mail composed, but decided to chop it down, but still too
>
> long. so I ditched a lot of the context, which jmf also seems to do.
>
> Apologies.
>
>
>
> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
>
> is an official encoding. FSR only differs from UTF-32 in that the
>
> padding zeros are stripped off such that it is stored in the most
>
> compact form that can handle all the characters in string, which is
>
> always known at string creation time. Now you can argue many things,
>
> but to say FSR is not unicode compliant is quite a stretch! What
>
> unicode entities or characters cannot be stored in strings using FSR?
>
> What sequences of bytes in FSR result in invalid Unicode entities?
>
>
>
> 2. strings in Python *never change*. They are immutable. The +
>
> operator always copies strings character by character into a new string
>
> object, even if Python had used UTF-8 internally. If you're doing a lot
>
> of string concatenations, perhaps you're using the wrong data type. A
>
> byte buffer might be better for you, where you can stuff utf-8 sequences
>
> into it to your heart's content.
>
>
>
> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
>
> slicing a string would be very very slow, and that's unacceptable for
>
> the use cases of python strings. I'm assuming you understand big O
>
> notation, as you talk of experience in many languages over the years.
>
> FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any
>
> variable-width encoding are always O(n). A lot slower!
>
>
>
> 4. Unicode is, well, unicode. You seem to hop all over the place from
>
> talking about code points to bytes to bits, using them all
>
> interchangeably. And now you seem to be claiming that a particular byte
>
> encoding standard is by definition unicode (UTF-8). Or at least that's
>
> how it sounds. And also claim FSR is not compliant with unicode
>
> standards, which appears to me to be completely false.
>
>
>
> Is my understanding of these things wrong?
------
Compare these (a BDFL exemple, where I'using a non-ascii char)
Py 3.2 (narrow build)
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.09897159682121348
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
0.09079501961732461
>>> sys.getsizeof('d')
32
>>> sys.getsizeof('€')
32
>>> sys.getsizeof('dd')
34
>>> sys.getsizeof('d€')
34
Py3.3
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.12183182740848858
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
0.2365732969632326
>>> sys.getsizeof('d')
26
>>> sys.getsizeof('€')
40
>>> sys.getsizeof('dd')
27
>>> sys.getsizeof('d€')
42
Tell me which one seems to be more "unicode compliant"?
The goal of Unicode is to handle every char "equaly".
Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).
>>> sys.getsizeof('abc' * 1000 + 'z')
3026
>>> sys.getsizeof('abc' * 1000 + '\U00010010')
12044
A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.
jmf
More information about the Python-list
mailing list