Glyphs and graphemes [was Re: Cult-like behaviour]

Richard Damon Richard at Damon-family.org
Mon Jul 16 19:02:36 EDT 2018


> On Jul 16, 2018, at 3:28 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> 
>> On 7/16/2018 1:11 PM, Richard Damon wrote:
>> 
>> Many consider that UTF-32 is a variable-width encoding because of the combining characters. It can take multiple ‘codepoints’ to define what should be a single ‘character’ for display.
> 
> I hope you realize that this is not the standard meaning of 'variable-width encoding', which is 'variable number of bytes for a codepoint'.  UTF-16 and UTF-8 are variable width.  If one expands the definition enough, Ascii is 'variable width' because 'fi' is two bytes, or more realistically, because <= and >= are two bytes instead of one (as they can be in Unicode!).
> 
> If one is using a broader definition than usual, it is clearer to say so.
> 
> -- 
> Terry Jan Reedy
> 

You are defining a variable/fixed width codepoint set. Many others want to deal with CHARACTER sets. The Unicode consortium agrees that a code point is not necessarily a character (which is one reason they came up with the term). When actually trying to do work with text strings, the fact that some codepoints are combining codes that need to ‘stick’ to their mate becomes important. One of the claimed advantages of fixed width character set encodings is that you aren’t supposed to need to worry about breaking strings in two, but that doesn’t work in Unicode, you need to make sure you aren’t breaking a combining sequence.

Even worse, Unicode really needs arbitrary look back to render substrings because it uses shift codes for things like left-to-right/right-to-left rendering control.

This doesn’t mean that UTF-32 is an awful system, just that it isn’t the magical cure that some were hoping for.


More information about the Python-list mailing list