A few questiosn about encoding
Νικόλαος Κούρας
support at superhost.gr
Thu Jun 13 03:42:40 EDT 2013
On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:
>> >>> chr(16474)
>> '䁚'
>>
>> Some Chinese symbol.
>> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
>
> Correct.
>
>
>> where in after encoding this glyph's ordinal value to binary gives us
>> the following bytes:
>>
>> >>> bin(16474).encode('utf-8')
>> b'0b100000001011010'
An observations here that you please confirm as valid.
1. A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.
So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.
That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')
So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.
> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?
> No! That creates a string from 16474 in base two:
> '0b100000001011010'
I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?
> Then you encode the string '0b100000001011010' into UTF-8. There are 17
> characters in this string, and they are all ASCII characters to they take
> up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).
0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?
More information about the Python-list
mailing list