New internal string format in 3.3
Peter Otten
__peter__ at web.de
Sun Aug 19 05:37:09 EDT 2012
Steven D'Aprano wrote:
> On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:
>
>> Steven D'Aprano wrote:
>
>>> I don't know where people are getting this myth that PEP 393 uses
>>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>>> that 1-byte formats are only used for ASCII strings.
>>
>> From
>>
>> Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
>> 4.6.1] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import sys
>>>>> [sys.getsizeof("é"*i) for i in range(10)]
>> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>
> Interesting. Say, I don't suppose you're using a 64-bit build? Because
> that would explain why your sizes are so larger than mine:
>
> py> [sys.getsizeof("é"*i) for i in range(10)]
> [25, 38, 39, 40, 41, 42, 43, 44, 45, 46]
>
>
> py> [sys.getsizeof("€"*i) for i in range(10)]
> [25, 40, 42, 44, 46, 48, 50, 52, 54, 56]
Yes, I am using a 64-bit build. I thought that
>> (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit
>> system) over ASCII-only.
would convey that. The corresponding data structure
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double
in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the
embedded PyASCIIObject yourself.
More information about the Python-list
mailing list