An assessment of the Unicode standard

Tom Morris tom at
Thu Sep 17 15:58:29 CEST 2009

On 2009-09-15, r <rt8396 at> wrote:
> Are you telling us people using a language that does not have a word
> for window somehow cannot comprehend what a window is, are you mad
> man?  Words are simply text attributes attached to objects. the text
> attribute doesn't change the object in any way. just think of is
> __repr__

Err, no, it's a bit more complicated than that. Words map to material
objects, to concepts, abstracta, sets, relations, states of affairs, to
mental states, to different senses of the same object. What object is
the word "bachelor" attached to? And why is it that suddenly the label -
that's all it is, after all - stops being applicable after a person gets

To use the classic example: "The Morning Star is the Evening Star." The
object is the same - Venus. But the sense in which the words are used
are different: you wouldn't say "that's the Evening Star!" in the
morning. If words are just dumb strings attached to objects, then
someone saying "The Morning Star is the Evening Star" is saying no more
than "a = a".

Your review of the Unicode standard is utterly naïve. There fact is that
even if everyone we could wave a magic wand and ensure that everyone on
the planet spoke the same language - English, for the sake of argument -
that would not negate the need for using other character sets. A
historian wants to typeset a book on ancient Greek civilization, where
Greek characters are used interchangably with English characters. Here,
having a uniform character set for all characters that one might
feasibly want to use from all known civilizations from throughout
history that it is practical to represent is superbly useful. Other
areas of life use their own symbols, many of which are present in the
Unicode specification including mathematical symbols, logic symbols,
musical notes, IPA phonetic symbols, currency symbols, chess and playing
card symbols, dingbats and much more. For basic typesetting, the Unicode
standard also contains a variety of spaces, dashes and other
typographical components which are not represented in Latin-1.

The fact is that every language with characters in the Unicode standard
generally have a large body of literature behind them - not necessarily
literature like Shakespeare, but things which tell the story of a
culture. How would you digitise those for search and study? Without the
characters to represent those languages, you could say that it would be
ideal to just translate them into the global language. Great. Do you
trust the translators to do the job once and forever? Take any ancient
text which still has relevance today for religion or culture or
philosophy, and you'll find that anyone who *really* wants to understand
it goes back to the original text in the original language. I'd really
love to have some excellent language-to-language compilers that could,
say, turn Ruby into Python into Java into C and vice versa. And do so
reliably. Where are they? Show me perfect machine translation and then
we can maybe stop bothering about other languages.


Tom Morris

More information about the Python-list mailing list