Custom alphabetical sort

Pander Musubi pander.musubi at gmail.com
Tue Dec 25 00:19:14 CET 2012


On Monday, December 24, 2012 7:12:43 PM UTC+1, Joshua Landau wrote:
> On 24 December 2012 16:18, Roy Smith <r... at panix.com> wrote:
> 
> 
> 
> 
> In article <40d108ec-b019-4829-a969-c8ef513866f1 at googlegroups.com>,
> 
>  Pander Musubi <pander... at gmail.com> wrote:
> 
> 
> 
> > Hi all,
> 
> 
> >
> 
> > I would like to sort according to this order:
> 
> >
> 
> > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
> 
> > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',
> 
> > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',
> 
> > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',
> 
> > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',
> 
> > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',
> 
> > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',
> 
> 
> > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> >
> 
> 
> > How can I do this? The default sorted() does not give the desired result.
> 
> 
> 
> <snip> 
> 
> 
> 
> 
> Given all that, I would start by writing some code which turned your
> 
> alphabet into a pair of dicts.  One maps from the code point to a
> 
> collating sequence number (i.e. ordinals), the other maps back.
> 
> Something like (for python 2.7):
> 
> 
> 
> alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
> 
>             '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
> 
>             [...]
> 
> 
>             'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> 
> 
> map1 = {c: n for n, c in enumerate(alphabet)}
> 
> map2 = {n: c for n, c in enumerate(alphabet)}
> 
> 
> 
> Next, I would write some functions which encode your strings as lists of
> 
> ordinals (and back again)
> 
> 
> 
> def encode(s):
> 
>    "encode('foo') ==> [34, 19, 19]"  # made-up ordinals
> 
>    return [map1[c] for c in s]
> 
> 
> 
> def decode(l):
> 
>    "decode([34, 19, 19]) ==> 'foo'"
> 
>     return ''.join(map2[i] for i in l)
> 
> 
> 
> Use these to convert your strings to lists of ints which will sort as
> 
> per your specified collating order, and then back again:
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> 
> encoded_strings.sort()
> 
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> This isn't needed and the not-so-new way to do this is through .sort's key attribute.
> 
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> encoded_strings.sort()
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> changes to
> 
> 
> 
> 
> encoded_strings.sort(key=encode)
> 
> 
> 
> [Which happens to be faster </reasonable_guess>]
> 
> 
> 
> 
> Hence you neither need map2 or decode:
> 
> 
> ## CODE ##
> 
> 
> 
> 
> 
> alphabet = (
> 	' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â',
> 
> 
> 	'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È',
> 
> 
> 	'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 'Ì', 'j', 'J', 'k', 'K', 'l', 'L',
> 
> 
> 	'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q',
> 
> 
> 	'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 'V', 'w', 'W', 'x', 'X',
> 
> 
> 	'y', 'Y', 'z', 'Z'
> )
> 
> 
> 
> hashindex = {character:index for index, character in enumerate(alphabet)}
> 
> def string2sortlist(string):
> 	return [hashindex[s] for s in string]
> 
> 
> 
> 
> # Quickly make some stuff to sort. Let's try 200k, as that's what's suggested.
> import random
> things_to_sort = ["".join(random.sample(alphabet, random.randint(4, 6))) for _ in range(200000)]
> 
> 
> 
> 
> print(things_to_sort[:15])
> 
> 
> things_to_sort.sort(key=string2sortlist)
> 
> 
> 
> 
> print(things_to_sort[:15])
> 
> 
> ## END CODE ##
> 
> 
> 
> 
> Not-so-coincidentally, this is exactly the same as Ian Kelly's extension to Tomas Bach's method.

With Python2.7 I had to use

alphabet = (
u' ', u'.', u'\'', u'-', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9', u'a', u'A', u'ä', u'Ä', u'á', u'Á', u'â', u'Â',
u'à', u'À', u'å', u'Å', u'b', u'B', u'c', u'C', u'ç', u'Ç', u'd', u'D', u'e', u'E', u'ë', u'Ë', u'é', u'É', u'ê', u'Ê', u'è', u'È',
u'f', u'F', u'g', u'G', u'h', u'H', u'i', u'I', u'ï', u'Ï', u'í', u'Í', u'î', u'Î', u'ì', u'Ì', u'j', u'J', u'k', u'K', u'l', u'L',
u'm', u'M', u'n', u'ñ', u'N', u'Ñ', u'o', u'O', u'ö', u'Ö', u'ó', u'Ó', u'ô', u'Ô', u'ò', u'Ò', u'ø', u'Ø', u'p', u'P', u'q', u'Q',
u'r', u'R', u's', u'S', u't', u'T', u'u', u'U', u'ü', u'Ü', u'ú', u'Ú', u'û', u'Û', u'ù', u'Ù', u'v', u'V', u'w', u'W', u'x', u'X',
u'y', u'Y', u'z', u'Z'
)

to prevent

Traceback (most recent call last):
  File "./sort.py", line 23, in <module>
    things_to_sort.sort(key=string2sortlist)
  File "./sort.py", line 15, in string2sortlist
    return [hashindex[s] for s in string]
KeyError: '\xc3'

Thanks very much for this efficient code.



More information about the Python-list mailing list