How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Fri Mar 18 07:46:09 EDT 2016


On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote:

> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson at gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.

Better than Python? Easy-peasy:

List of languages with Unicode handling which is better than Python = []

I'm not aware of any language with better or more complete Unicode
functionality than Python's. (That doesn't necessarily mean that they don't
exist.)


> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation.

Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For
instance, your UTF-8 string might consist of an array of bytes containing
the string, plus an array of indexes to the start of each code point. For
example, the string:

“abcπßЊ•𒀁”

(including the quote marks) is 10 code points in length and 22 bytes as
UTF-8. Grouping the (hex) bytes for each code point, we have:

e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d

so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the
indexes (in decimal) in which each code point starts:

e2809c616263cf80c39fd08ae280a2f0928081e2809d

0 3 4 5 6 8 10 12 15 19

but (assuming each index needs 2 bytes, which supports strings up to 65535
characters in length), that's actually LESS memory efficient than UTF-32:
42 bytes versus 40.


> He has 
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].

I think it is typical of JMF that his idea of a language where Unicode "just
works" is one where it *does work at all* (at least not as strings). Python
1.5 strings supported Unicode just as well as Go's string class.

In Go, the right way to handle Unicode is to use "runes", not strings. I
don't know how well that works though -- I suspect it is still pretty
primitive.


> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go

Nice link, thanks!




-- 
Steven




More information about the Python-list mailing list