Changing filenames from Greeklish => Greek (subprocess complain)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Jun 9 02:25:10 EDT 2013
On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote:
> Why does every character in a character set needs to be associated with
> a numeric value?
Because computers are digital, not analog, and because bytes are numbers.
Here are a few of the 256 possible bytes, written in binary, decimal and
hexadecimal:
0b00000000 0 0x00
0b00000001 1 0x01
0b00000010 2 0x02
[...]
0b01111111 127 0x7F
0b10000000 128 0x80
[...]
0b11111110 254 0xFE
0b11111111 255 0xFF
EVERYTHING in computers are numbers, because everything is stored as
bytes. Text is stored as bytes. Sound files are stored as bytes. Images
are stored as bytes. Programs are stored as bytes. So everything is being
stored as numbers. But the *meaning* we give to those numbers depends on
what we do with them, whether we treat them as characters, bitmapped
images, floating point values, or something else.
Once we decide we want to store the character "A" as bytes, we need to
decide which number it should be. That is the job of the charset.
ASCII:
65 <--> 'A'
66 <--> 'B'
67 <--> 'C'
etc.
> I mean couldn't we just have characters sets that wouldn't have numeric
> associations like:
>
> 'A' => encoding process(i.e. uf-8) => bytes bytes => decoding
> process(i.e. utf-8) => character 'A'
No. How would you store it in a computer's memory, or on a hard drive? By
carving a tiny, microscopic "A" onto the hard drive? How would you read
it back?
It is theoretically possible to build an analog computer, out of
clockwork, or water flowing through pipes, or something, but nobody
really bothers because it is much harder and not very useful.
> An ordinal = ordered numbers like 7,8,910 and so on?
Yes.
> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
> values up to 256?
Because then how do you tell when you need one byte, and when you need
two? If you read two bytes, and see 0x4C 0xFA, does that mean two
characters, with ordinal values 0x4C and 0xFA, or one character with
ordinal value 0x4CFA?
UTF-8 solves this problem by reserving some values to mean "this byte, on
its own", and others to mean "this byte, plus the next byte, together",
and so forth, up to four bytes.
If you look up UTF-8 on Wikipedia, you will see more about this.
> UTF-8 and UTF-16 and UTF-32
> I though the number beside of UTF- was to declare how many bits the
> character set was using to store a character into the hdd, no?
Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes
it combines two, three or four of them to represent a single code-point.
> > "Narrow" Unicode uses two bytes per character. Since two bytes is only
> > enough for about 65,000 characters, not 1,000,000+, the rest of the
> > characters are stored as pairs of two-byte "surrogates".
>
> Can you please explain this line "the rest of thecharacters are stored
> as pairs of two-byte "surrogates"" more easily for me to understand it?
> I'm still having troubl understanding what a surrogate is.
Look up UTF-16 and "surrogate pair" on Wikepedia.
But basically, there are 65000+ different possible 16-bit values
available for UTF-16 to use. Some of those values are reserved to mean
"this value is not a character, it is half of a surrogate pair". Since
they are *pairs*, they must always come in twos. A surrogate pair makes
up a valid character. Half of a surrogate pair, on its own, is an error.
A lot of this complexity is because of historical reasons. For example,
when Unicode was first invented, there was only 65 thousand characters,
and a fixed 16 bits was all you needed. But it was soon learned that 65
thousand was not enough (there are more than 65,000 Asian characters
alone!) and so UTF-16 developed the trick with surrogate pairs to cover
the extras.
[...]
> When locale to linux system is set to utf-8 that would mean that the
> linux applications, should try to encode string into hdd by using
> system's default encoding to utf-8 nad read them back from bytes by
> also using utf-8. Is that correct?
Yes.
--
Steven
More information about the Python-list
mailing list