byte count unicode string
sjmachin at lexicon.net
Wed Sep 20 09:39:34 CEST 2006
> Marc 'BlackJack' Rintsch:
> >In <mailman.313.1158732191.10491.python-l... at python.org>, willie wrote:
> >> # What's the correct way to get the
> >> # byte count of a unicode (UTF-8) string?
> >> # I couldn't find a builtin method
> >> # and the following is memory inefficient.
> >> ustr = "example\xC2\x9D".decode('UTF-8')
> >> num_chars = len(ustr) # 8
> >> buf = ustr.encode('UTF-8')
> >> num_bytes = len(buf) # 9
> >That is the correct way.
> # Apologies if I'm being dense, but it seems
> # unusual that I'd have to make a copy of a
> # unicode string, converting it into a byte
> # string, before I can determine the size (in bytes)
> # of the unicode string. Can someone provide the rational
> # for that or correct my misunderstanding?
You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".
It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?
More information about the Python-list