How to waste computer memory?
Jussi Piitulainen
jussi.piitulainen at helsinki.fi
Fri Mar 18 04:26:30 EDT 2016
Ian Kelly writes:
> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
> <rantingrickjohnson at gmail.com> wrote:
>> In the event that i change my mind about Unicode, and/or for
>> the sake of others, who may want to know, please provide a
>> list of languages that *YOU* think handle Unicode better than
>> Python, starting with the best first. Thanks.
>
> jmf has been asked this before, and as I recall he seems to feel that
> UTF-8 should be used for all purposes, ignoring the limitations of
> that encoding such as that indexing becomes a O(n) operation. He has
> pointed at Go as an example of a language wherein Unicode "just
> works", although I think that others do not necessarily agree [1].
...
> [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go
I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
promising. Indexing is by bytes (1-based in Julia) but the value at a
valid index is the whole UTF-8 character at that point, and an invalid
index raises an exception.
The letters "ö" and "ä" are two bytes each in UTF-8.
julia> s = "myöhä"
"myöhä"
julia> s[3]
'ö'
julia> s[4]
ERROR: UnicodeError: invalid character index
in next at ./unicode/utf8.jl:65
in getindex at strings/basic.jl:37
julia> s[5]
'h'
Julia provides access to the next character at an index and the valid
index after that:
julia> next(s, 3)
('ö',5)
The last valid index:
julia> endof(s)
6
Special syntax to index at the end of a string:
julia> s[end - 1:end]
"hä"
That's not quite right. The penultimate character happened to be one
byte, so it worked. At least incorrect indexing results in an exception
rather than an incorrect value. There is a proper method to get a
previous valid index - I should have used that.
Also, the length of a string is the number of characters rather than
bytes, decoupled from the indexing.
julia> length("myöhä")
5
I work with text all the time, but I don't think I ever _need_ arbitrary
access to an nth character. What I require is access to the start and
end of a string, searching, and splitting. These all seem compatible
with using UTF-8 representations. Same with iterating over the string
(forward or backward).
Just in case: I've been quite happy with Unicode in Python 3. It's just
interesting to see a different way that also seems to work.
[2] http://docs.julialang.org/en/release-0.4/manual/strings/
More information about the Python-list
mailing list