hex dump w/ or w/out utf-8 chars
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Jul 11 23:18:44 EDT 2013
On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:
> And what to say about this "ucs4" char/string '\U0001d11e' which is
> weighting 18 bytes more than an "a".
>
>>>> sys.getsizeof('\U0001d11e')
> 44
>
> A total absurdity.
You should stick to Python 3.1 and 3.2 then:
py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36
Now all your strings will be just as heavy, every single variable name
and attribute name will use four times as much memory. Happy now?
> How does is come? Very simple, once you split Unicode
> in subsets, not only you have to handle these subsets, you have to
> create "markers" to differentiate them. Not only, you produce "markers",
> you have to handle the mess generated by these "markers". Hiding this
> markers in the everhead of the class does not mean that they should not
> be counted as part of the coding scheme. BTW, since when a serious
> coding scheme need an extermal marker?
Since always.
How do you think that (say) a C compiler can tell the difference between
the long 1199876496 and the float 67923.125? They both have exactly the
same four bytes:
py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'
*Everything* in a computer is bytes. The only way to tell them apart is
by external markers.
--
Steven
More information about the Python-list
mailing list