How to waste computer memory?
Steven D'Aprano
steve at pearwood.info
Fri Mar 18 07:46:09 EDT 2016
On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote:
> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson at gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.
Better than Python? Easy-peasy:
List of languages with Unicode handling which is better than Python = []
I'm not aware of any language with better or more complete Unicode
functionality than Python's. (That doesn't necessarily mean that they don't
exist.)
> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation.
Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For
instance, your UTF-8 string might consist of an array of bytes containing
the string, plus an array of indexes to the start of each code point. For
example, the string:
“abcπßЊ•𒀁”
(including the quote marks) is 10 code points in length and 22 bytes as
UTF-8. Grouping the (hex) bytes for each code point, we have:
e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d
so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the
indexes (in decimal) in which each code point starts:
e2809c616263cf80c39fd08ae280a2f0928081e2809d
0 3 4 5 6 8 10 12 15 19
but (assuming each index needs 2 bytes, which supports strings up to 65535
characters in length), that's actually LESS memory efficient than UTF-32:
42 bytes versus 40.
> He has
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].
I think it is typical of JMF that his idea of a language where Unicode "just
works" is one where it *does work at all* (at least not as strings). Python
1.5 strings supported Unicode just as well as Go's string class.
In Go, the right way to handle Unicode is to use "runes", not strings. I
don't know how well that works though -- I suspect it is still pretty
primitive.
> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go
Nice link, thanks!
--
Steven
More information about the Python-list
mailing list