How to waste computer memory?

Fri Mar 18 04:26:30 EDT 2016

Ian Kelly writes:

> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson at gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.
>
> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation. He has
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].

...

> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go

I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
promising. Indexing is by bytes (1-based in Julia) but the value at a
valid index is the whole UTF-8 character at that point, and an invalid
index raises an exception.

The letters "ö" and "ä" are two bytes each in UTF-8.

julia> s = "myöhä"
"myöhä"

julia> s[3]
'ö'

julia> s[4]
ERROR: UnicodeError: invalid character index
 in next at ./unicode/utf8.jl:65
 in getindex at strings/basic.jl:37

julia> s[5]
'h'

Julia provides access to the next character at an index and the valid
index after that:

julia> next(s, 3)
('ö',5)

The last valid index:

julia> endof(s)
6

Special syntax to index at the end of a string:

julia> s[end - 1:end]
"hä"

That's not quite right. The penultimate character happened to be one
byte, so it worked. At least incorrect indexing results in an exception
rather than an incorrect value. There is a proper method to get a
previous valid index - I should have used that.

Also, the length of a string is the number of characters rather than
bytes, decoupled from the indexing.

julia> length("myöhä")
5

I work with text all the time, but I don't think I ever _need_ arbitrary
access to an nth character. What I require is access to the start and
end of a string, searching, and splitting. These all seem compatible
with using UTF-8 representations. Same with iterating over the string
(forward or backward).

Just in case: I've been quite happy with Unicode in Python 3. It's just
interesting to see a different way that also seems to work.

[2] http://docs.julialang.org/en/release-0.4/manual/strings/