Newbie question about text encoding
Rustom Mody
rustompmody at gmail.com
Tue Mar 3 23:16:02 EST 2015
On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote:
> On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> > Rustom Mody wrote:
> >
> > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> > >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> > >> >> Wrote something up on why we should stop using ASCII:
> > >> >> http://blog.languager.org/2015/02/universal-unicode.html
> > >>
> > >> I think that the main point of the post, that many Unicode chars are
> > >> truly planetary rather than just national/regional, is excellent.
> > >
> > > <snipped>
> > >
> > >> You should add emoticons, but not call them or the above 'gibberish'.
> > >> I think that this part of your post is more 'unprofessional' than the
> > >> character blocks. It is very jarring and seems contrary to your main
> > >> point.
> > >
> > > Ok Done
> > >
> > > References to gibberish removed from
> > > http://blog.languager.org/2015/02/universal-unicode.html
> >
> > I consider it unethical to make semantic changes to a published work in
> > place without acknowledgement. Fixing minor typos or spelling errors, or
> > dead links, is okay. But any edit that changes the meaning should be
> > commented on, either by an explicit note on the page itself, or by striking
> > out the previous content and inserting the new.
>
> Dunno What you are grumping about…
>
> Anyway the attribution is made more explicit – footnote 5 in
> http://blog.languager.org/2015/03/whimsical-unicode.html.
>
> Note Terry Reedy's post who mainly objected was already acked earlier.
> Ive just added one more ack¹
> And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.
>
> >
> > As for the content of the essay, it is currently rather unfocused.
>
> True.
>
> It
> > appears to be more of a list of "here are some Unicode characters I think
> > are interesting, divided into subgroups, oh and here are some I personally
> > don't have any use for, which makes them silly" than any sort of discussion
> > about the universality of Unicode. That makes it rather idiosyncratic and
> > parochial. Why should obscure maths symbols be given more importance than
> > obscure historical languages?
>
> Idiosyncratic ≠ parochial
>
>
> >
> > I think that the universality of Unicode could be explained in a single
> > sentence:
> >
> > "It is the aim of Unicode to be the one character set anyone needs to
> > represent every character, ideogram or symbol (but not necessarily distinct
> > glyph) from any existing or historical human language."
> >
> > I can expand on that, but in a nutshell that is it.
> >
> >
> > You state:
> >
> > "APL and Z Notation are two notable languages APL is a programming language
> > and Z a specification language that did not tie themselves down to a
> > restricted charset ..."
>
> Tsk Tsk – dihonest snipping. I wrote
>
> | APL and Z Notation are two notable languages APL is a programming language
> | and Z a specification language that did not tie themselves down to a
> | restricted charset even in the day that ASCII ruled.
>
> so its clear that the restricted applies to ASCII
> >
> > You list ideographs such as Cuneiform under "Icons". They are not icons.
> > They are a mixture of symbols used for consonants, syllables, and
> > logophonetic, consonantal alphabetic and syllabic signs. That sits them
> > firmly in the same categories as modern languages with consonants, ideogram
> > languages like Chinese, and syllabary languages like Cheyenne.
>
> Ok changed to iconic.
> Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
> In 2015 when someone sees them and recognizes them, they are 'those things that
> Sumerians/Egyptians wrote' No one except a rare expert knows those languages
>
> >
> > Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> > unimportant. There are probably more people who need to write Cuneiform
> > than people who need to write APL source code.
> >
> > You make a comment:
> >
> > "To me – a unicode-layman – it looks unprofessional… Billions of computing
> > devices world over, each having billions of storage words having their
> > storage wasted on blocks such as these??"
> >
> > But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> > Why are you so worried about an (illusionary) minor optimization?
>
> 2 < 4 as far as I am concerned.
> [If you disagree one man's illusionary is another's waking]
>
> >
> > Whether code points are allocated or not doesn't affect how much space they
> > take up. There are millions of unused Unicode code points today. If they
> > are allocated tomorrow, the space your documents take up will not increase
> > one byte.
> >
> > Allocating code points to Cuneiform has not increased the space needed by
> > Unicode at all. Two bytes alone is not enough for even existing human
> > languages (thanks China). For hardware related reasons, it is faster and
> > more efficient to use four bytes than three, so the obvious and "dumb" (in
> > the simplest thing which will work) way to store Unicode is UTF-32, which
> > takes a full four bytes per code point, regardless of whether there are
> > 65537 code points or 1114112. That makes it less expensive than floating
> > point numbers, which take eight. Would you like to argue that floating
> > point doubles are "unprofessional" and wasteful?
> >
> > As Dave pointed out, and you apparently agreed with him enough to quote him
> > TWICE (once in each of two blog posts), history of computing is full of
> > premature optimizations for space. (In fact, some of these may have been
> > justified by the technical limitations of the day.) Technically Unicode is
> > also limited, but it is limited to over one million code points, 1114112 to
> > be exact, although some of them are reserved as invalid for technical
> > reasons, and there is no indication that we'll ever run out of space in
> > Unicode.
> >
> > In practice, there are three common Unicode encodings that nearly all
> > Unicode documents will use.
> >
> > * UTF-8 will use between one and (by memory) four bytes per code
> > point. For Western European languages, that will be mostly one
> > or two bytes per character.
> >
> > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual
> > Plane, which is enough for nearly all Western European writing and
> > much East Asian writing as well. For the rest, it uses a fixed four
> > bytes per code point.
> >
> > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses
> > this as a storage format.
> >
> >
> > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
> > doesn't change the space used. If you actually include a few hieroglyphs to
> > your document, the space increases only by the actual space used by those
> > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
> > single hieroglyph in your document force you to expand the non-hieroglyph
> > characters to use more space.
> >
> >
> > > What I was trying to say expanded here
> > > http://blog.languager.org/2015/03/whimsical-unicode.html
> >
> > You have at least two broken links, referring to a non-existent page:
> >
> > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html
>
> Thanks corrected
>
> >
> > This essay seems to be even more rambling and unfocused than the first. What
> > does the cost of semi-conductor plants have to do with whether or not
> > programmers support Unicode in their applications?
> >
> > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> > it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> > or text editor's encoding cookies.
> >
> > Your discussion of "complexifiers and simplifiers" doesn't seem to be
> > terribly relevant, or at least if it is relevant, you don't give any reason
> > for it. The whole thing about Moore's Law and the cost of semi-conductor
> > plants seems irrelevant to Unicode except in the most over-generalised
> > sense of "things are bigger today than in the past, we've gone from
> > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?
>
> - Most people need only 16 bits.
> - Many notable examples of software fail going from 16 to 23.
> - If you are a software writer, and you fail going 16 to 23 its ok but try to
> give useful errors
Uh… 21
Thats what makes 3 chars per 64-bit word a possibility.
A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions.
More information about the Python-list
mailing list