Flexible string representation, unicode, typography, ...
Mark Lawrence
breamoreboy at yahoo.co.uk
Thu Aug 23 10:18:05 EDT 2012
On 23/08/2012 13:47, wxjmfauth at gmail.com wrote:
> This is neither a complaint nor a question, just a comment.
> In the previous discussion related to the flexible
> string representation, Roy Smith added this comment:
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
> Not only I agree with his sentence:
> "Clearly, the world has moved to a 32-bit character set."
> he used in his comment a very intersting word: "punctuation".
> There is a point which is, in my mind, not very well understood,
> "digested", underestimated or neglected by many developers:
> the relation between the coding of the characters and the typography.
> Unicode (the consortium), does not only deal with the coding of
> the characters, it also worked on the characters *classification*.
> A deliberatly simplistic representation: "letters" in the bottom
> of the table, lower code points/integers; "typographic characters"
> like punctuation, common symbols, ... high in the table, high code
> points/integers.
> The conclusion is inescapable, if one wish to work in a "unicode
> mode", one is forced to use the whole palette of the unicode
> code points, this is the *nature* of Unicode.
> Technically, believing that it possible to optimize only a subrange
> of the unicode code points range is simply an illusion. A lot of
> work, probably quite complicate, which finally solves nothing.
> Python, in my mind, fell in this trap.
> "Simple is better than complex."
> -> hard to maintained
> "Flat is better than nested."
> -> code points range
> "Special cases aren't special enough to break the rules."
> -> special unicode code points?
> "Although practicality beats purity."
> -> or the opposite?
> "In the face of ambiguity, refuse the temptation to guess."
> -> guessing a user will only work with the "optimmized" char subrange.
> ...
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> and you will see all the optimization efforts destroyed.
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040
> Just my 2 € (code point 0x20ac) cents.
> jmf
I'm looking forward to all the patches you are going to provide to
correct all these (presumably) cPython deficiencies. When do they start
arriving on the bug tracker?
Mark Lawrence.
More information about the Python-list
mailing list