[Tutor] Sort

Steven D'Aprano steve at pearwood.info
Mon Jan 30 16:17:00 CET 2012


George Nyoro wrote:
> Hey all, again:
> Thanks for the delete thing. Helped with that problem a lot. Especially the
> getattr thing came in handy.
> 
> Question 1:
> How do you guys indent and put in the triple greater than signs and all
> that when sending mail? Manually? Coz the last mail I sent was using the
> gmail indent thing and it doesnt seem to have worked. Anyway:


Copy text including >>> signs, and any other characters you want.
Paste into your email client. The characters should appear exactly as you 
copied and pasted them.

If your email client (gmail) takes it upon itself to mangle your text, then 
use a better email client that doesn't suck. (Any software which thinks it 
knows what you want better than you do should be taken out and slapped with a 
wet fish.)


> Question 2:
> I have this script that downloaded a lot of files from one server into one
> local directory and the files are named in a particular order and I want
> them to be opened in this very same order so that I can copy the contents
> of each to one main file and then convert to pdf. However, the naming
> schema is like this (Roman numerals in parts) and python doesn't seem to be
> able to sort it properly.
> 1.I.1; 1.I.2; ....; 1.I.20;
> 1.II.1.1;1.II.1.2
> 1.II.2;...1.II.10; 2.I.3;

I don't understand this. Is each line ONE file name, or are there many file 
names per line? Are the semi-colons part of the file names?

I'm going to *guess* that each file name has THREE sections only, with no 
semi-colons, and look like these 3 examples:

1.I.1
1.III.20
5.IX.7

etc. Am I close?


> Because of the roman numerals that are always in the second part fo the
> filename, it could sort based on how I wanted.
> 
> I listed all the files in the said directory:
> files=os.listdir(location)
> 
> I tried using a sort algorithm that goes through the list starting from
> index 0 and then compares each index with the next one and if they are not
> arranged as they should be, the two are switched and then it starts again.
> The thing is, it is so slow! It hardly goes past index three in the list
> and there are so many files. *about fifty*.

Congratulations! You appear to have re-invented one of the world's slowest 
sorting functions, bubblesort. <wink> Look it up on Wikipedia.

Only probably with a bug in it, because even bubblesort shouldn't get stuck 
after only three files.

Except as a learning exercise, or to prove you can, never try to program your 
own sort in Python. That's like buying a car, and then putting it on the back 
of your push-bike and trying to cycle home with it. Uphill. In the snow. With 
a flat tire.

Instead, you should use Python's built-in sort function, which if you use it 
properly is faster than anything you can write (unless you are Tim Peters, who 
invented it in the first place).

Normally, you would just do something like this:

files = os.listdir(location)
files.sort()  # sort the list in-place, without making a copy


or perhaps this:

files = sorted(os.listdir(location))  # make a copy and sort


but in this case you have special requirements, thanks to the naming 
convention for your files. In this case, you don't want the ordinary sort 
order for strings, you need a special sort order based on Roman numerals.

The best way to sort these is with a technique called:

Decorate, Sort, Undecorate

also known as DSU, which sorted() supports using a key function. It sounds 
harder than it is. In this case, you need to decorate each file name (which is 
hard to sort!) with something that is easier to sort. In this case, the 
filenames look like:

number DOT roman number DOT number

all as one string, so the obvious way to sort them is to convert to three 
numbers. The first and last numbers are easy, you just use the int() function, 
the roman number is a bit trickier because Python doesn't have a built-in 
converter.

So here is my converter, and a decorate function that uses it:

def from_roman(s):
     # Convert Roman numerals into an integer.
     table = {'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5, 'VI': 6,
              'VII': 7, 'VIII': 8, 'IX': 9, 'X': 10}  # Is this enough?
     return table[s.upper()]


def decorate(filename):
     a, b, c = filename.split(".")  # Split into three pieces, at the dots.
     a = int(a)  # Convert the first and last into integers.
     c = int(c)
     b = from_roman(b)  # and the middle using roman numerals
     return (a, b, c)



And finally you can do the sorting like so:


files = sorted(os.listdir(location), key=decorate)


Note that Python will do the undecorate part automatically.

Warning: I have not tested any of this. It should work, but it's not 
unpossible for me to have make a misteak.



-- 
Steven



More information about the Tutor mailing list