shuffle the lines of a large file
vze4rx4y at verizon.net
Mon Mar 7 22:46:16 CET 2005
> I am looking for a method to "shuffle" the lines of a large file.
> I have a corpus of sorted and "uniqed" English sentences that has been
> produced with (1):
> (1) sort corpus | uniq > corpus.uniq
> corpus.uniq is 80G large.
Since the corpus is huge, the python portion should not pull it all into memory.
The best bet is to let the o/s tools take care of the that part:
>>> from random import random
>>> out = open('corpus.decorated', 'w')
>>> for line in open('corpus.uniq'):
print >> out, '%.14f %s' % (random(), line),
sort corpus.decorated | cut -c 18- > corpus.randomized
More information about the Python-list