New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Peter Otten
__peter__ at web.de
Sun Aug 19 03:43:13 EDT 2012
Steven D'Aprano wrote:
> On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
>
>> "a" will be stored as 1 byte/codepoint.
>>
>> Adding "é", it will still be stored as 1 byte/codepoint.
>
> Wrong. It will be 2 bytes, just like it already is in Python 3.2.
>
> I don't know where people are getting this myth that PEP 393 uses Latin-1
> internally, it does not. Read the PEP, it explicitly states that 1-byte
> formats are only used for ASCII strings.
From
Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51)
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> [sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>>> [sys.getsizeof("e"*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
>>> sys.getsizeof("é"*101)-sys.getsizeof("é")
100
>>> sys.getsizeof("e"*101)-sys.getsizeof("e")
100
>>> sys.getsizeof("€"*101)-sys.getsizeof("€")
200
I infer that
(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system)
over ASCII-only.
More information about the Python-list
mailing list