[Tutor] reading random line from a file
Aditya Lal
a_n_lal at yahoo.com
Thu Jul 19 15:14:59 CEST 2007
An alternative approach (I found the Yorick's code to
be too slow for large # of calls) :
We can use file size to pick a random point in the
file. We can read and ignore text till next new line.
This will avoid outputting partial lines. Return the
next line (which I guess is still random :)).
Indicative code -
import os,random
def getrandomline(filename) :
offset = random.randint(0,os.stat(filename)[6])
fd = file(filename,'rb')
fd.seek(offset)
fd.readline() # Read and ignore
return fd.readline()
getrandomline("shaks12.txt")
Caveat: The above code will never choose 1st line and
will return '' for last line. Other than the boundary
conditions it will work well (even for large files).
Interestingly :
On modifying this code to take in file object rather
than filename, the performance improved by ~50%. On
wrapping it in a class, it further improved by ~25%.
On executing the get random line 100,000 times on
large file (size 2707519 with 9427 lines), the class
version finished < 5 seconds.
Platform : 2GHz Intel Core 2 Duo macBook (2GB RAM)
running Mac OSX (10.4.10).
Output using python 2.5.1 (stackless)
Approach using enum approach : 9.55798196793 : for
[100] iterations
Approach using filename : 11.552863121 : for [100000]
iterations
Approach using file descriptor : 5.97015094757 : for
[100000] iterations
Approach using class : 4.46039891243 : for [100000]
iterations
Output using python 2.3.5 (default python on OSX)
Approach using enum approach : 12.2886080742 : for
[100] iterations
Approach using filename : 12.5682640076 : for [100000]
iterations
Approach using file descriptor : 6.55952501297 : for
[100000] iterations
Approach using class : 5.35413718224 : for [100000]
iterations
I am attaching test program FYI.
--
Aditya
--- Nathan Coulter
<com.gmail.kyleaschmitt at pooryorick.com> wrote:
> > -------Original Message-------
> > From: Tiger12506 <keridee at jayco.net>
>
> > Yuck. Talk about a one shot function! Of course
> it only reads through the
> > file once! You only call the function once. Put a
> second print randline(f)
> > at the bottom of your script and see what happens
> :-)
> >
> > JS
> >
>
> *sigh*
>
> #!/bin/env python
>
> import os
> import random
>
> text = 'shaks12.txt'
> if not os.path.exists(text):
> os.system('wget
> http://www.gutenberg.org/dirs/etext94/shaks12.txt')
>
> def randline(f):
> for i,j in enumerate(file(f, 'rb')):
> if random.randint(0,i) == i:
> line = j
> return line
>
> print randline(text)
> print randline(text)
> print randline(text)
>
> --
> Yorick
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
____________________________________________________________________________________
Sucker-punch spam with award-winning protection.
Try the free Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/features_spam.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randline.py
Type: text/script
Size: 2039 bytes
Desc: 941365746-randline.py
Url : http://mail.python.org/pipermail/tutor/attachments/20070719/cf073b56/attachment.bin
More information about the Tutor
mailing list