Trying to set a cookie within a python script

MRAB python at mrabarnett.plus.com
Tue Aug 3 15:04:07 EDT 2010


Dave Angel wrote:
> ¯º¿Â wrote:
>>> On 3 Αύγ, 18:41, Dave Angel <da... at ieee.org> wrote:
>>>    
>>>> Different encodings equal different ways of storing the data to the
>>>> media, correct?
>>>>       
>>> Exactly. The file is a stream of bytes, and Unicode has more than 256
>>> possible characters. Further, even the subset of characters that *do*
>>> take one byte are different for different encodings. So you need to tell
>>> the editor what encoding you want to use.
>>>     
>>
>> For example an 'a' char in iso-8859-1 is stored different than an 'a'
>> char in iso-8859-7 and an 'a' char of utf-8 ?
>>
>>
>>   
> Nope, the ASCII subset is identical. It's the ones between 80 and ff 
> that differ, and of course not all of those. Further, some of the codes 
> that are one byte in 8859 are two bytes in utf-8.
> 
> You *could* just decide that you're going to hardwire the assumption 
> that you'll be dealing with a single character set that does fit in 8 
> bits, and most of this complexity goes away. But if you do that, do 
> *NOT* use utf-8.
> 
> But if you do want to be able to handle more than 256 characters, or 
> more than one encoding, read on.
> 
> Many people confuse encoding and decoding. A unicode character is an 
> abstraction which represents a raw character. For convenience, the first 
> 128 code points map directly onto the 7 bit encoding called ASCII. But 
> before Unicode there were several other extensions to 256, which were 
> incompatible with each other. For example, a byte which might be a 
> European character in one such encoding might be a kata-kana character 
> in another one. Each encoding was 8 bits, but it was difficult for a 
> single program to handle more than one such encoding.
> 
One encoding might be ASCII + accented Latin, another ASCII + Greek,
another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
Greek then you'd need more than 1 byte per character.

If you're working with multiple alphabets it gets very messy, which is
where Unicode comes in. It contains all those characters, and UTF-8 can
encode all of them in a straightforward manner.

> So along comes unicode, which is typically implemented in 16 or 32 bit 
> cells. And it has an 8 bit encoding called utf-8 which uses one byte for 
> the first 192 characters (I think), and two bytes for some more, and 
> three bytes beyond that.
> 
[snip]
In UTF-8 the first 128 codepoints are encoded to 1 byte.



More information about the Python-list mailing list