RE Module Performance
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Wed Jul 31 04:32:13 EDT 2013
FSR:
===
The 'a' in 'a€' and 'a\U0001d11e:
>>> ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')]
['0b00000000', '0b01100001', '0b00100000', '0b10101100']
>>> ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']
Has to be done.
sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27
Unicode/utf*
============
i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32
Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.
Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.
jmf
More information about the Python-list
mailing list