string storage [was: Re: imaplib: is this really so unwieldy?]
Alan Gauld
alan.gauld at yahoo.co.uk
Wed May 26 03:18:48 EDT 2021
On 25/05/2021 23:23, Terry Reedy wrote:
> In CPython's Flexible String Representation all characters in a string
> are stored with the same number of bytes, depending on the largest
> codepoint.
I'm learning lots of new things in this thread!
Does that mean that if I give Python a UTF8 string that is mostly single
byte characters but contains one 4-byte character that Python will store
the string as all 4-byte characters?
If so, doesn't that introduce a pretty big storage overhead for
large strings?
>
> >>> sys.getsizeof('\U00011111')
> 80
> >>> sys.getsizeof('\U00011111'*2)
> 84
> >>> sys.getsizeof('a\U00011111')
> 84
Which is what this seems to be saying.
I confess I had just assumed the unicode strings were stored
in native unicode UTF8 format.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Python-list
mailing list