A few questiosn about encoding

Νικόλαος Κούρας support at superhost.gr
Thu Jun 13 02:09:19 EDT 2013


On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> Two, three or four, depending on the codepoint.

The amount of bytes needed by UTF-8 to store a code-point(character), 
depends on the ordinal value of the code-point in the Unicode charset, 
correct?

If this is correct then the higher the ordinal value(which is an decimal 
integer) in the Unicode charset the more bytes needed for storage.

Its like the bigger a decimal integer is the bigger binary number it 
produces.

Is this correct?


>> example for codepoint 256, 1345, 16474 ?
>
> You can do this yourself. I have already given you enough information in
> previous emails to answer this question on your own, but here it is again:
>
> Open an interactive Python session, and run this code:
>
> c = ord(16474)
> len(c.encode('utf-8'))
>
>
> That will tell you how many bytes are used for that example.
This si actually wrong.

ord()'s arguments must be a character for which we expect its ordinal value.

 >>> chr(16474)
'䁚'

Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?

where in after encoding this glyph's ordinal value to binary gives us 
the following bytes:

 >>> bin(16474).encode('utf-8')
b'0b100000001011010'

Now, we take tow symbols out:

'b' symbolism which is there to tell us that we are looking a bytes 
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary 
representation of a bytes object

Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?


but thinking this through more and more:

 >>> chr(16474).encode('utf-8')
b'\xe4\x81\x9a'
 >>> len(b'\xe4\x81\x9a')
3

it seems that the bytestring the encode process produces is of length 3.

So i take it is 3 bytes?

but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>> 
chr(16474).encode('utf-8') is telling us here.

Care to explain that too please ?









More information about the Python-list mailing list