shuffle the lines of a large file

Raymond Hettinger vze4rx4y at verizon.net
Mon Mar 7 16:46:16 EST 2005


[Joerg Schuster]
> I am looking for a method to "shuffle" the lines of a large file.
>
> I have a corpus of sorted and "uniqed" English sentences that has been
> produced with (1):
>
> (1) sort corpus | uniq > corpus.uniq
>
> corpus.uniq is 80G large.

Since the corpus is huge, the python portion should not pull it all into memory.
The best bet is to let the o/s tools take care of the that part:

>>> from random import random
>>> out = open('corpus.decorated', 'w')
>>> for line in open('corpus.uniq'):
        print >> out, '%.14f %s' % (random(), line),

>>> out.close()

sort corpus.decorated | cut -c 18- > corpus.randomized


Raymond Hettinger





More information about the Python-list mailing list