A few questiosn about encoding
Νικόλαος Κούρας
support at superhost.gr
Thu Jun 13 02:21:28 EDT 2013
On 12/6/2013 11:30 μμ, Nobody wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> U+0000..U+007F 1 byte
> U+0080..U+07FF 2 bytes
> U+0800..U+FFFF 3 bytes
>> =U+10000 4 bytes
'U' stands for Unicode code-point which means a character right?
How can you be able to tell up to what character utf-8 needs 1 byte or 2
bytes or 3?
And some of the bytes' bits are used to tell where a code-points
representations stops, right? i mean if we have a code-point that needs
2 bytes to be stored that the high bit must be set to 1 to signify that
this character's encoding stops at 2 bytes.
I just know that 2^8 = 256, that's by first look 265 places, which mean
256 positions to hold a code-point which in turn means a character.
We take the high bit out and then we have 2^7 which is enough positions
for 0-127 standard ASCII. High bit is set to '0' to signify that char is
encoded in 1 byte.
Please tell me that i understood correct so far.
But how about for 2 or 3 or 4 bytes?
Am i saying ti correct ?
More information about the Python-list
mailing list