[Tutor] close, but no cigar

Steven D'Aprano steve at pearwood.info
Tue Jul 23 16:46:38 CEST 2013


On 23/07/13 04:27, Jim Mooney wrote:
> Okay, I'm getting there, but this should be translating A umlaut to an old
> DOS box character, according to my ASCII table,


I understand what you mean, but I should point out that what you say is *literally impossible*, since neither Ä nor any box-drawing characters are part of ASCII. What you are saying is figuratively equivalent to this:

...should be translating Москва to モスクワ according to my Latin to French dictionary...

Even if the ancient Romans knew of the city of Moscow, they didn't write it in Cyrillic and you certainly can't get Japanese characters by translating it to French.

Remember that ASCII only has 128 characters, and *everything else* is non-ASCII, whether they are line-drawing characters, European accented letters, East Asian characters, emoticons, or ancient Egyptian. People who talk about "extended ASCII" are confused, and all you need to do to show up their confusion is to ask "which extended ASCII do you mean?" There are dozens.

For example, ordinal value 0xC4 (hex, = 196 in decimal) has the following meaning depending on the version of "extended ASCII" you use:

Ä LATIN CAPITAL LETTER A WITH DIAERESIS
  HEBREW POINT HIRIQ
Δ GREEK CAPITAL LETTER DELTA
ؤ ARABIC LETTER WAW WITH HAMZA ABOVE
─ BOX DRAWINGS LIGHT HORIZONTAL
ƒ LATIN SMALL LETTER F WITH HOOK


using encodings Latin1, CP1255, ISO-8859-7, ISO-8859-6, IBM866, and MacRoman, in that order. And there are many others.

So the question is, if you have a file name with byte 196 in it, which character is intended? In isolation, you cannot possibly tell. As an English speaker, I've used at least four of the above six, although only three in file names. With single-byte encodings, limited to a mere 256 characters (128 of which are already locked down to the ASCII charset[1]), you can't have all of the above except by using Unicode[2].

The old "code pages" technology is sheer chaos, and sadly we'll be living with it for years to come. But eventually, maybe in another 30 years or so, everyone will use Unicode all the time, except for specialist and legacy needs, and gradually we'll get past this nonsense of dozens of encodings and moji-bake and other crap.





[1] Not all encodings are ASCII-compatible, but most of them are.

[2] Or something like it. In Japan, there is a proprietary charset called TRON which includes even more characters than Unicode. Both TRON and Unicode aim to include every human character which has ever been used, but they disagree as to what counts as distinct characters. In a nutshell, there are some tens of thousands or so characters which are written the same way in Chinese, Japanese and Korean, but used differently. Unicode's policy is that you can tell from context which is meant, and gives them a single code-point each, while TRON gives them three code-points. This is not quite as silly as saying that an English E, a German E and a French E should be considered three distinct characters, but (in my opinion) not far off it.


-- 
Steven


More information about the Tutor mailing list