[Tutor] Re: Sorting of files based on filesize

Mon Apr 11 22:09:32 CEST 2005

Klas Marteleur <klas.martelleur <at> telia.com> writes:

> Some of my harddrives are getting full and i would like to burn the files to 
> some cheep DVD's. Filesizes range from lets say 1Mb to 1Gb. 
> 
> Ofcourse i would like to optimize the size of each DVD to be as close to 4.7Gb 
> as possible (to save this cheep media :) ).
> 
> There are plenty of software that are able to index my DVD's so sorting 
> between disks is not a problem. Just size.
> 
> What i would like to do is write a Python program that could sort my files 
> (based on size) into chunks as close to a given storage size (DVD 4.7Gb, CD 
> 650Mb...), and minimize the use of disks. Does anyone know of if something 
> like this is done before in Python, that i could look at?

Depends on how optimal you want things done. The easiest way is probably to sort
them by size, start with the largest and then keep adding smaller ones to the
compilation until it's full. Then start with the largest that's left and repeat
the process until you're left with (part of) a DVD with all kinds of small
files. I have some doubts about how optimal this will turn out to be. 

So instead, I went and implemented a fun 100-line script which uses a simple
genetic algorithm to generate disc layouts. Lacking real files, the script
listed below first generates 500 files with normally distributed sizes with an
average of 200 MB and a standard deviation of 200MB (with some sanity checks to
prevent files with negative size and files larger than 650MB) and tries to cram
those in 650MB CD's. 

Here's the link: http://ww3.6URL.com/FOK

It does this by first shuffling the list of files in a random manner several
times and determining the best fit of those (the solution that fills up the
discs as much as possible). This is the initial parent. Then it randomly
shuffles parts of this parent several times and determines the best fit of those
combinations. Then it uses that best fit to generate new children, etc. 

The siblings variable determines how many shuffles are performed before a best
fit in that group is selected. The generations variable determines how many
times the process of selecting the best is repeated.

The algorithm seems to work reasonably well (I get 190-ish CD's for files which
would cover 180-ish CD's if those CD's could be filled perfectly, even though
initial random file distributions give 240+ discs), but I haven't checked the
results very thoroughly. It might be doing something really stupid. It's also
reasonably fast (1000 generations of 1000 individuals each is feasible for
example, but not necessarily very useful). There's also no guarantee that the
result is the absolute best you can get.

Yours,

Andrei