Flexible string representation, unicode, typography, ...

Fri Aug 24 10:38:11 EDT 2012

On Thursday, 23 August 2012 18:17:29 UTC+5:30, (unknown)  wrote:
> This is neither a complaint nor a question, just a comment.
> 
> 
> 
> In the previous discussion related to the flexible
> 
> string representation, Roy Smith added this comment:
> 
> 
> 
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
> 
> 
> 
> Not only I agree with his sentence:
> 
> "Clearly, the world has moved to a 32-bit character set."
> 
> 
> 
> he used in his comment a very intersting word: "punctuation".
> 
> 
> 
> There is a point which is, in my mind, not very well understood,
> 
> "digested", underestimated or neglected by many developers:
> 
> the relation between the coding of the characters and the typography.
> 
> 
> 
> Unicode (the consortium), does not only deal with the coding of
> 
> the characters, it also worked on the characters *classification*.
> 
> 
> 
> A deliberatly simplistic representation: "letters" in the bottom
> 
> of the table, lower code points/integers; "typographic characters"
> 
> like punctuation, common symbols, ... high in the table, high code
> 
> points/integers. 
> 
> 
> 
> The conclusion is inescapable, if one wish to work in a "unicode
> 
> mode", one is forced to use the whole palette of the unicode
> 
> code points, this is the *nature* of Unicode.
> 
> 
> 
> Technically, believing that it possible to optimize only a subrange
> 
> of the unicode code points range is simply an illusion. A lot of
> 
> work, probably quite complicate, which finally solves nothing.
> 
> 
> 
> Python, in my mind, fell in this trap.
> 
> 
> 
> "Simple is better than complex."
> 
>   -> hard to maintained
> 
> "Flat is better than nested." 
> 
>   -> code points range
> 
> "Special cases aren't special enough to break the rules."
> 
>   -> special unicode code points?
> 
> "Although practicality beats purity."
> 
>  -> or the opposite?
> 
> "In the face of ambiguity, refuse the temptation to guess."
> 
>   -> guessing a user will only work with the "optimmized" char subrange.
> 
> ...
> 
> 
> 
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> 
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> 
> and you will see all the optimization efforts destroyed.
> 
> 
> 
> >> sys.getsizeof('a' * 80 * 50)
> 
> 4025
> 
> >>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> 8040
> 
> 
> 
> Just my 2 € (code point 0x20ac) cents.
> 
> 
> 
> jmf

The zen of python is simply a guideline