Changing filenames from Greeklish => Greek (subprocess complain)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Jun 9 08:31:44 EDT 2013
On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:
> Steven wrote:
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.
Think about it. Draw up a big table of one million plus characters:
Ordinal Character
======== ====================
0 NUL control code
1 SOH control code
...
84 LATIN CAPITAL LETTER T
85 LATIN CAPITAL LETTER U
...
255 LATIN SMALL LETTER Y WITH DIAERESIS
256 LATIN CAPITAL LETTER A WITH MACRON
...
8485 OUNCE SIGN
and so on, all the way to 1114111. Now, suppose you read a file, and see
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54
followed by 0x55.
How do you tell whether that means two characters, T followed by U, or a
single character, ℥ (OUNCE SIGN)?
With UTF-32, you can, because every value takes exactly the same space.
So a T followed by a U is:
0x00000054
0x00000055
while a single ℥ is:
0x00002125
and it is easy to tell them apart: each block of 4 bytes is exactly one
character. But notice how many NUL bytes there are? In the three
characters shown, there are eight NUL bytes. Most text will be filled
with NUL bytes, which is very wasteful.
UTF-8 is designed to be compact, and also to be backwards-compatible with
ASCII. Characters which are in ASCII will be a single byte, so there are
no null bytes used for padding, (except for NUL itself, of course). So
the three characters TU℥ will be:
0x54
0x55
0xE2
0x84
0xA5
Five bytes in total, instead of 12 for UTF-32. But the only tricky part
is that character with ordinal value 0xE2 (decimal 226, â) cannot be
encoded as the single byte 0xE2, otherwise you would mistake the three
bytes 0xE284A5 as starting with 'â' followed by two more characters. And
indeed, 'â' is encoded as two bytes:
0xC3
0xA2
Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also
encoded as two bytes:
0xC3
0x83
And so on. This way, there is never any confusion as to whether (say)
three bytes are three one-byte characters, or one three-byte character.
>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?
Yes, a surrogate pair is a pair of two "characters". But they're not
*real* characters. They don't exist in any human language. They are just
values that tells the program "these go together, and count as a single
character".
(This is why Unicode prefers to talk about *code points* rather than
characters. Some code points are characters, and some are not.)
>>UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
>>them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
Correct.
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
> 127 )
That looks like two characters to me, 'α' followed by '΄'. That will take
4 bytes, two for 'α' and two for '΄'.
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
> ? (since ordinal > 65000 )
Not necessarily four bytes. Could be three. Depends on the ideogram.
> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?
Yes.
>>UTF-8 solves this problem by reserving some values to mean "this byte,
>>on its own", and others to mean "this byte, plus the next byte,
>>together", and so forth, up to four bytes.
>
> Some of the utf-8 bits that are used to represent a character's ordinal
> value are actually been also used to seperate or join the ordinal values
> themselves? Can you give an example please? How there are beign
> seperated?
Did you look up UTF-8 on Wikipedia like I suggested?
>>Computers are digital and work with numbers.
>
> So character 'A' <-> 65 (in decimal uses in charset's table) <->
> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the
> file with a hex editor)
>
> Is this how the thing works? (above values are fictional)
You can check this in Python:
py> c = 'A'
py> ord(c)
65
py> bin(65)
'0b1000001'
py> hex(65)
'0x41'
py> c = 'α'
py> ord(c)
945
py> c.encode('utf-8')
b'\xce\xb1'
py> c.encode('utf-16be')
b'\x03\xb1'
py> c.encode('utf-32be')
b'\x00\x00\x03\xb1'
py> c.encode('iso-8859-7')
b'\xe1'
--
Steven
More information about the Python-list
mailing list