Changing filenames from Greeklish => Greek (subprocess complain)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Jun 7 11:33:31 EDT 2013
On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote:
> Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st
> 0-127 codepoints similar?
You can answer this yourself. Open a terminal window and start a Python
interactive session. Then try it and see what happens:
s = ''.join(chr(i) for i in range(128))
bytes_as_utf8 = s.encode('utf-8')
bytes_as_latin1 = s.encode('latin-1')
bytes_as_greek_iso = s.encode('ISO-8859-7')
bytes_as_ascii = s.encode('ascii')
bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii
What result do you get? True or False?
And now you know the answer, without having to ask.
> For example char 'a' has the value of '65' for all of those character
> sets? Is hat what you mean?
You can answer that question yourself.
c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(c.encode(encoding))
By the way, I believe that Python has made a strategic mistake in the way
that bytes are printed. I think it leads to more confusion, not less.
Better would be something like this:
c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(hex(c.encode(encoding)[0]))
For historical reasons, most (but not all) charsets are supersets of
ASCII. That is, the first 128 characters in the charset are the same as
the 128 characters in ASCII.
> s = 'a' (This is unicode right? Why when we assign a string to a
> variable that string's type is always unicode
Strings in Python 3 are Unicode strings. That's just the way Python
works. Unicode was chosen because Unicode includes over a million
different characters (well, potentially over a million, most of them are
currently unused), and is a strict superset of *all* common legacy
codepages from the old DOS and Windows 95 days.
> and does not automatically
> become utf-8 which includes all available world-wide characters? Unicode
> is something different that a character set? )
Unicode is a character set. It is an enormous set of over one million
characters (technically "code point", but don't worry about the
difference right now) which can be collected in strings.
UTF-8 is an encoding that goes from a string using the Unicode character
set into bytes, and back again. Sometimes, people are lazy and say
"UTF-8" when they mean "Unicode", or visa versa.
UTF-16 and UTF-32 are two different encodings for the same purpose, but
for various technical reasons UTF-8 is better for files.
'λ' is a character which exists in some charsets but not others. It is
not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the
ISO-8859-7 charset, and of course it is in Unicode.
In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235),
just as the character 'a' is stored as byte 0x61 (decimal 97).
In UTF-8, the character λ is stored as two bytes 0xCE 0xBB.
In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB.
In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00
0x03 0xBB.
That's four different ways of "spelling" the same character as bytes,
just as "three", "trois", "drei", "τρία", "três" are all different ways
of spelling the same number 3.
> utf8_byte = s.encode('utf-8')
>
> Now if we are to decode this back to utf8 we will receive the char 'a'.
> I beleive same thing will happen with latin, greek, ascii isos. Correct?
Why don't you try it for yourself and see?
> The characters that will not decode correctly are those that their
> codepoints are greater that > 127 ?
Maybe, maybe not. It depends on which codepoint, and which encodings.
Some encodings use the same bytes for the same characters. Some encodings
use different bytes. It all depends on the encoding, just like American
and English both spell 3 "three", while French spells it "trois".
> for example if s = 'α' (greek character equivalent to english 'a')
In Latin-1, 'α' does not exist:
py> 'α'.encode('latin-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in
position 0: ordinal not in range(256)
In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1:
py> 'α'.encode('ISO-8859-7')
b'\xe1'
But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means
a completely different character, CYRILLIC SMALL LETTER ES:
py> b'\xE1'.decode('ISO-8859-5')
'с'
(don't be fooled that this looks like the English c, it is not the same).
In Unicode, 'α' is always codepoint 0x3B1 (decimal 945):
py> ord('α')
945
but before you can store that on a disk, or as a file name, it needs to
be converted to bytes, and which bytes you get depends on which encoding
you use:
py> 'α'.encode('utf-8')
b'\xce\xb1'
py> 'α'.encode('utf-16be')
b'\x03\xb1'
py> 'α'.encode('utf-32be')
b'\x00\x00\x03\xb1'
--
Steven
More information about the Python-list
mailing list