Changing filenames from Greeklish => Greek (subprocess complain)

Νικόλαος Κούρας nikos.gr33k at gmail.com
Mon Jun 10 10:27:04 EDT 2013


Τη Δευτέρα, 10 Ιουνίου 2013 2:59:03 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:
> 
> 
> 
> > Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> 
> > έγραψε:
> 
> > 
> 
> >> py> c = 'α'
> 
> >> py> ord(c)
> 
> >> 945
> 
> > 
> 
> > The number 945 is the characters 'α' ordinal value in the unicode
> 
> > charset correct?
> 
> 
> 
> Correct.
> 
> 
> 
> 
> 
> > The command in the python interactive session to show me how many bytes
> 
> > this character will take upon encoding to utf-8 is:
> 
> > 
> 
> >>>> s = 'α'
> 
> >>>> s.encode('utf-8')
> 
> > b'\xce\xb1'
> 
> > 
> 
> > I see that the encoding of this char takes 2 bytes. But why two exactly?
> 
> 
> 
> Because that's how UTF-8 works. If it was a different encoding, it might 
> 
> be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
> 
> 2 bytes. If you want to understand how UTF-8 works, look it up on 
> 
> Wikipedia. 
> 
> 
> 
> 
> 
> > How do i calculate how many bits are needed to store this char into
> 
> > bytes?
> 
> 
> 
> Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.
> 
> 
> 
> 
> 
> > Trying to to the same here but it gave me no bytes back.
> 
> > 
> 
> >>>> s = 'a'
> 
> >>>> s.encode('utf-8')
> 
> > b'a'
> 
> 
> 
> There is a byte there. The byte is printed by Python as b'a', which in my  
> opinion is a design mistake. That makes it look like a string, but it is  
> not a string, and would be better printed as b'\x61'. But regardless of 
> the display, it is still a single byte.


Perhaps, up to 127 ASCII chars python thinks its better for human to read the character representaion of the stored byte, instead of hex's. Just a guess.

> Just like 0o1234 uses octal, "o" for Octal.
> And 0x123EF uses hexadecimal. "x" for heXadecimal.

Why the leadin zero before octal's 'o' and hex's 'x'  and binary's 'b' ?


Iam not goin to tired you any more, because ia hve exhaust myself tlo days now tryign to get my head around this.

Please confirm i ahve understood correctly:

I did but docs confuse me even more. Can you pleas ebut it simple.

Unicode as i understand it was created out of need for a bigger character set compared to ASCII which could hold up to 127 chars(and extended versions of it up to 256), that could be able to hold all worlds symbols.

ASCII and Unicode are character sets.

Everything else sees to be an encoding system that work upon those characters sets.

If what i said is true the last thing that still confuses me is that

iso-8859-7(256 chars) seems like charactet set and an encoding method too.
Can it be both or it is iso-8859-7 encoding method of Unicode character set similar as uTF8 is also Unicode's encoding method?



More information about the Python-list mailing list