A few questiosn about encoding

Νικόλαος Κούρας support at superhost.gr
Wed Jun 12 05:09:05 EDT 2013


>> (*) infact UTF8 also indicates the end of each character

> Up to a point.  The initial byte encodes the length and the top few
> bits, but the subsequent octets aren’t distinguishable as final in
> isolation.  0x80-0xBF can all be either medial or final.


So, the first high-bits are a directive that UTF-8 uses to know how many 
bytes each character is being represented as.

0-127 codepoints(characters) use 1 bit to signify they need 1 bit for 
storage and the rest 7 bits to actually store the character ?

while

128-256 codepoints(characters) use 2 bit to signify they need 2 bits for 
storage and the rest 14 bits to actually store the character ?

Isn't 14 bits way to many to store a character ? 



More information about the Python-list mailing list