shuffle the lines of a large file

Simon Brunning simon.brunning at gmail.com
Tue Mar 8 09:28:09 EST 2005


On Tue, 8 Mar 2005 14:13:01 +0000, Simon Brunning
<simon.brunning at gmail.com> wrote:
> On 7 Mar 2005 06:38:49 -0800, gry at ll.mit.edu <gry at ll.mit.edu> wrote:
> > As far as I can tell, what you ultimately want is to be able to extract
> > a random ("representative?") subset of sentences.
> 
> If this is what's wanted, then perhaps some variation on this cookbook
> recipe might do the trick:
> 
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865

I couldn't resist. ;-)

import random
            
def randomLines(filename, lines=1):
    selected_lines = list(None for line_no in xrange(lines))
        
    for line_index, line in enumerate(open(filename)):
        for selected_line_index in xrange(lines):
            if random.uniform(0, line_index) < 1:
                selected_lines[selected_line_index] = line
            
    return selected_lines

This has the advantage that every line had the same chance of being
picked regardless of its length. There is the chance that it'll pick
the same line more than once, though.

-- 
Cheers,
Simon B,
simon at brunningonline.net,
http://www.brunningonline.net/simon/blog/



More information about the Python-list mailing list