I was noticing that `np.triu_indices` took quite awhile and discovered it creates an upper triu array and then uses `np.where`. This seems quite inefficient and I was curious if something like the following would be better:
tmp_range = np.arange(dim-k)
rows = np.repeat(tmp_range,(tmp_range+1)[::-1])
cols = np.ones(rows.shape,dtype=np.int)
inds = np.cumsum(tmp_range[1:][::-1]+1)
cols = k
This is just a first run at the function, and unfortunately does not work for k<0. However, it does return the correct results for k>=0 and is between 2-8 faster on my machine then `np.triu_indices`. Any thoughts on this?