[Tutor] Counting number of occurrence of each character in a string

Wed Sep 2 09:47:18 EDT 2020

On 02/09/2020 13:42, Peter Otten wrote:

>> Which could be done in a one-liner as:
>>
>> d2 = {k:a.count(k) for k in sorted(set(a))}
> 
> This may be acceptable if the alphabet is small, but will cause trouble 
> otherwise. 

Its not the base alphabet that is the issue its the number of
characters actually used, which in real world cases is likely
to be far smaller than the available alphabet

However, you do raise a very valid point...

> To illustrate, here are some timings for the pathological case 
> (each character occurs just once):
> 
> $ python3 -m timeit -s 'from collections import Counter; import sys; s = "".join(map(chr, range(sys.maxunicode//50)))'  '{c: s.count(c) for c in set(s)}'
> 10 loops, best of 3: 955 msec per loop

But what about the other extreme where only one character occurs?

###################
import sys, time

s1 = "".join(map(chr, range(sys.maxunicode//50)))
s2 = "a" * (sys.maxunicode//50)

start = time.time()
d = {c: s1.count(c) for c in set(s1)}
end = time.time()
print("s1 duration: ", end-start)

start = time.time()
d = {c: s2.count(c) for c in set(s2)}
end = time.time()
print("s2 duration: ", end-start)

start = time.time()
from collections import Counter
end = time.time()
print("Import time: ", end-start)

start = time.time()
c = Counter(s1)
end = time.time()
print("s1 Duration2: ", end-start)

start = time.time()
c = Counter(s2)
end = time.time()
print("s2 Duration2: ", end-start)
#######################

produces:
s1 duration:  0.86718416219398926
s2 duration:  0.00393986701965332
Import time:  1.4066696166992188e-05
s1 Duration2:  0.008202075958251953
s2 Duration2:  0.0035851001739501953

So Counter is still faster but not by so much.
More important that studying edge cases is that
it has a much narrower spread of results so is
clearly a better choice.

There's always a compromise, you have to pick which is
easiest to read and maintain and delivers acceptable
performance for your specific data set. Which usually
means testing...

In this case Counter wins on all aspects: simpler & faster
but at the expense of including an extra module dependency,
albeit one from the standard library.

Which brings us back to the OPs post (and several others
from the same author). Comparing arcane idioms of the core
language often distracts from the fact that a better solution
exists in the standard library. Spending time with detailed
study of the language is usually wasteful compared to time
spent studying the library. (Of course once you fully know
the library (does anyone?!) studying the language details
may be beneficial.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos