Trying to set a cookie within a python script

Dave Angel davea at ieee.org
Tue Aug 3 16:41:54 EDT 2010



MRAB wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Dave 
> Angel wrote:
>> ¯º¿Â wrote:
>>>> On 3 Αύγ, 18:41, Dave Angel <da... at ieee.org> wrote:
>>>>> Different encodings equal different ways of storing the data to the
>>>>> media, correct?
>>>> Exactly. The file is a stream of bytes, and Unicode has more than 256
>>>> possible characters. Further, even the subset of characters that *do*
>>>> take one byte are different for different encodings. So you need to 
>>>> tell
>>>> the editor what encoding you want to use.
>>>
>>> For example an 'a' char in iso-8859-1 is stored different than an 'a'
>>> char in iso-8859-7 and an 'a' char of utf-8 ?
>>>
>>>
>> Nope, the ASCII subset is identical. It's the ones between 80 and ff 
>> that differ, and of course not all of those. Further, some of the 
>> codes that are one byte in 8859 are two bytes in utf-8.
>>
>> You *could* just decide that you're going to hardwire the assumption 
>> that you'll be dealing with a single character set that does fit in 8 
>> bits, and most of this complexity goes away. But if you do that, do 
>> *NOT* use utf-8.
>>
>> But if you do want to be able to handle more than 256 characters, or 
>> more than one encoding, read on.
>>
>> Many people confuse encoding and decoding. A unicode character is an 
>> abstraction which represents a raw character. For convenience, the 
>> first 128 code points map directly onto the 7 bit encoding called 
>> ASCII. But before Unicode there were several other extensions to 256, 
>> which were incompatible with each other. For example, a byte which 
>> might be a European character in one such encoding might be a 
>> kata-kana character in another one. Each encoding was 8 bits, but it 
>> was difficult for a single program to handle more than one such 
>> encoding.
>>
> One encoding might be ASCII + accented Latin, another ASCII + Greek,
> another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
> Greek then you'd need more than 1 byte per character.
>
> If you're working with multiple alphabets it gets very messy, which is
> where Unicode comes in. It contains all those characters, and UTF-8 can
> encode all of them in a straightforward manner.
>
>> So along comes unicode, which is typically implemented in 16 or 32 
>> bit cells. And it has an 8 bit encoding called utf-8 which uses one 
>> byte for the first 192 characters (I think), and two bytes for some 
>> more, and three bytes beyond that.
>>
> [snip]
> In UTF-8 the first 128 codepoints are encoded to 1 byte.
>
>
Thanks for the correction. As I said, I wasn't sure. I did utf-8 encoder 
and decoder about a dozen years ago, and I remember parts of it use the 
top two bits specially. But I've checked now, and you're right, the 
cutoff is 7f.

DaveA




More information about the Python-list mailing list