Changing filenames from Greeklish => Greek (subprocess complain)
Andreas Perstinger
andipersti at gmail.com
Mon Jun 10 04:15:38 EDT 2013
On 10.06.2013 09:10, nagia.retsina at gmail.com wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
>
>> py> c = 'α'
>> py> ord(c)
>> 945
>
> The number 945 is the characters 'α' ordinal value in the unicode charset correct?
Yes, the unicode character set is just a big list of characters. The
946th character in that list (starting from 0) happens to be 'α'.
> The command in the python interactive session to show me how many bytes
> this character will take upon encoding to utf-8 is:
>
>>>> s = 'α'
>>>> s.encode('utf-8')
> b'\xce\xb1'
>
> I see that the encoding of this char takes 2 bytes. But why two exactly?
That's how the encoding is designed. Haven't you read the wikipedia
article which was already mentioned several times?
> How do i calculate how many bits are needed to store this char into bytes?
You need to understand how UTF-8 works. Read the wikipedia article.
> Trying to to the same here but it gave me no bytes back.
>
>>>> s = 'a'
>>>> s.encode('utf-8')
> b'a'
The encode method returns a byte object. It's length will tell you how
many bytes there are:
>>> len(b'a')
1
>>> len(b'\xce\xb1')
2
The python interpreter will represent all values below 256 as ASCII
characters if they are printable:
>>> ord(b'a')
97
>>> hex(97)
'0x61'
>>> b'\x61' == b'a'
True
The Python designers have decided to use b'a' instead of b'\x61'.
>>py> c.encode('utf-8')
>> b'\xce\xb1'
>
> 2 bytes here. why 2?
Same as your first question.
>> py> c.encode('utf-16be')
>> b'\x03\xb1'
>
> 2 byets here also. but why 3 different bytes? the ordinal value of
> char 'a' is the same in unicode. the encodign system just takes the
> ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
> be the same?
'utf-16be' is a different encoding scheme, thus it uses other rules to
determine how each character is translated into a byte sequence.
>> py> c.encode('iso-8859-7')
>> b'\xe1'
>
> And also does '\x' means that the value is being respresented in hex way?
> and when i bin(6) i see '0b1000001'
>
> I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
>
'\x' is an escape sequence and means that the following two characters
should be interpreted as a number in hexadecimal notation (see also the
table of allowed escape sequences:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
).
'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
>>> bin(70)
'0b1000110'
>>> 0b100110 == 0b00100110
True
>>> 0b100110 == 0b0000000000100110
True
It's the same with decimal notation. You wouldn't say 00123 is different
from 123, would you?
Bye, Andreas
More information about the Python-list
mailing list