hex dump w/ or w/out utf-8 chars
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Thu Jul 11 14:44:16 EDT 2013
Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm... at gmail.com a écrit :
> Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
>
> > On Thu, Jul 11, 2013 at 11:18 PM, <wxjmfauth at gmail.com> wrote:
>
> >
>
> > > Just to stick with this funny character ẞ, a ucs-2 char
>
> >
>
> > > in the Flexible String Representation nomenclature.
>
> >
>
> > >
>
> >
>
> > > It seems to me that, when one needs more than ten bytes
>
> >
>
> > > to encode it,
>
> >
>
> > >
>
> >
>
> > >>>> sys.getsizeof('a')
>
> >
>
> > > 26
>
> >
>
> > >>>> sys.getsizeof('ẞ')
>
> >
>
> > > 40
>
> >
>
> > >
>
> >
>
> > > this is far away from the perfection.
>
> >
>
> >
>
> >
>
> > Better comparison is to see how much space is used by one copy of it,
>
> >
>
> > and how much by two copies:
>
> >
>
> >
>
> >
>
> > >>> sys.getsizeof('aa')-sys.getsizeof('a')
>
> >
>
> > 1
>
> >
>
> > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
>
> >
>
> > 2
>
> >
>
> >
>
> >
>
> > String objects have overhead. Big deal.
>
> >
>
> >
>
> >
>
> > > BTW, for a modern language, is not ucs2 considered
>
> >
>
> > > as obsolete since many, many years?
>
> >
>
> >
>
> >
>
> > Clearly. And similarly, the 16-bit integer has been completely
>
> >
>
> > obsoleted, as there is no reason anyone should ever bother to use it.
>
> >
>
> > Same with the float type - everyone uses double or better these days,
>
> >
>
> > right?
>
> >
>
> >
>
> >
>
> > http://www.postgresql.org/docs/current/static/datatype-numeric.html
>
> >
>
> > http://www.cplusplus.com/doc/tutorial/variables/
>
> >
>
> >
>
> >
>
> > Nope, nobody uses small integers any more, they're clearly completely obsolete.
>
> >
>
> >
>
> >
>
>
>
> Sure there is some overhead because a str is a class.
>
> It still remain that a "ẞ" weights 14 bytes more than
>
> an "a".
>
>
>
> In "aẞ", the ẞ weights 6 bytes.
>
>
>
> >>> sys.getsizeof('a')
>
> 26
>
> >>> sys.getsizeof('aẞ')
>
> 42
>
>
>
> and in "aẞẞ", the ẞ weights 2 bytes
>
>
>
> sys.getsizeof('aẞẞ')
>
>
>
> And what to say about this "ucs4" char/string '\U0001d11e' which
>
> is weighting 18 bytes more than an "a".
>
>
>
> >>> sys.getsizeof('\U0001d11e')
>
> 44
>
>
>
> A total absurdity. How does is come? Very simple, once you
>
> split Unicode in subsets, not only you have to handle these
>
> subsets, you have to create "markers" to differentiate them.
>
> Not only, you produce "markers", you have to handle the
>
> mess generated by these "markers". Hiding this markers
>
> in the everhead of the class does not mean that they should
>
> not be counted as part of the coding scheme. BTW, since
>
> when a serious coding scheme need an extermal marker?
>
>
>
>
>
>
>
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
>
> 1
>
>
>
> Shortly, if my algebra is still correct:
>
>
>
> (overhead + marker + 2*'a') - (overhead + marker + 'a')
>
> = (overhead + marker + 2*'a') - overhead - marker - 'a'
>
> = overhead - overhead + marker - marker + 2*'a' - 'a'
>
> = 0 + 0 + 'a'
>
> = 1
>
>
>
> The "marker" has magically disappeared.
>
>
>
> jmf
More information about the Python-list
mailing list