A few questiosn about encoding
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Jun 13 03:11:08 EDT 2013
On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:
> On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
>> Open an interactive Python session, and run this code:
>>
>> c = ord(16474)
>> len(c.encode('utf-8'))
>>
>>
>> That will tell you how many bytes are used for that example.
> This si actually wrong.
>
> ord()'s arguments must be a character for which we expect its ordinal
> value.
Gah!
That's twice I've screwed that up. Sorry about that!
> >>> chr(16474)
> '䁚'
>
> Some Chinese symbol.
> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
Correct.
> where in after encoding this glyph's ordinal value to binary gives us
> the following bytes:
>
> >>> bin(16474).encode('utf-8')
> b'0b100000001011010'
No! That creates a string from 16474 in base two:
'0b100000001011010'
The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In
hex form, they are:
b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'
which takes up a lot more room, which is why Python prefers to show ASCII
characters as characters rather than as hex.
What you want is:
chr(16474).encode('utf-8')
[...]
> Thus, there we count 15 bits left.
> So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
> statements correct please?
No. There are 17 BYTES there. The string "0" doesn't get turned into a
single bit. It still takes up a full byte, 0x30, which is 8 bits.
> but thinking this through more and more:
>
> >>> chr(16474).encode('utf-8')
> b'\xe4\x81\x9a'
> >>> len(b'\xe4\x81\x9a')
> 3
>
> it seems that the bytestring the encode process produces is of length 3.
Correct! Now you have got the right idea.
--
Steven
More information about the Python-list
mailing list