shuffle the lines of a large file
Raymond Hettinger
vze4rx4y at verizon.net
Mon Mar 7 16:46:16 EST 2005
[Joerg Schuster]
> I am looking for a method to "shuffle" the lines of a large file.
>
> I have a corpus of sorted and "uniqed" English sentences that has been
> produced with (1):
>
> (1) sort corpus | uniq > corpus.uniq
>
> corpus.uniq is 80G large.
Since the corpus is huge, the python portion should not pull it all into memory.
The best bet is to let the o/s tools take care of the that part:
>>> from random import random
>>> out = open('corpus.decorated', 'w')
>>> for line in open('corpus.uniq'):
print >> out, '%.14f %s' % (random(), line),
>>> out.close()
sort corpus.decorated | cut -c 18- > corpus.randomized
Raymond Hettinger
More information about the Python-list
mailing list