shuffle the lines of a large file

Alex Stapleton alexs at advfn.com
Mon Mar 7 09:20:40 EST 2005


Woops typo.

	else:
		buffer.shuffle()
		for line in buffer:
			print line

should be

	else:
		random.shuffle(buffer)
		for line in buffer:
			print line

of course

-----Original Message-----
From: python-list-bounces+alexs=advfn.com at python.org
[mailto:python-list-bounces+alexs=advfn.com at python.org]On Behalf Of Alex
Stapleton
Sent: 07 March 2005 14:17
To: Joerg Schuster; python-list at python.org
Subject: RE: shuffle the lines of a large file


Not tested this, run it  (or some derivation thereof) over the output to get
increasing randomness.
You will want to keep max_buffered_lines as high as possible really I
imagine. If shuffle() is too intensize
you could itterate over the buffer several times randomly removing and
printing lines until the buffer is empty/suitibly small removing some more
processing overhead.

### START ###
import random

f = open('corpus.uniq')

buffer = []
max_buffered_lines = 1000

for line in f:
	if len(buffer) < max_buffered_lines:
		buffer.append(line)
	else:
		buffer.shuffle()
		for line in buffer:
			print line

random.shuffle(buffer)
for line in buffer:
	print line


f.close()

### END ###

-----Original Message-----
From: python-list-bounces+alexs=advfn.com at python.org
[mailto:python-list-bounces+alexs=advfn.com at python.org]On Behalf Of
Joerg Schuster
Sent: 07 March 2005 13:37
To: python-list at python.org
Subject: shuffle the lines of a large file


Hello,

I am looking for a method to "shuffle" the lines of a large file.

I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):

(1) sort corpus | uniq > corpus.uniq

corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in.  Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences.

So, it would be very useful to do one of the following things:

- produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled

Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.

Any ideas?

Joerg Schuster

--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list