[Tutor] reading random line from a file

Aditya Lal a_n_lal at yahoo.com
Thu Jul 19 15:14:59 CEST 2007


An alternative approach (I found the Yorick's code to
be too slow for large # of calls) :

We can use file size to pick a random point in the
file. We can read and ignore text till next new line.
This will avoid outputting partial lines. Return the
next line (which I guess is still random :)). 

Indicative code -

import os,random

def getrandomline(filename) :
  offset = random.randint(0,os.stat(filename)[6])
  fd = file(filename,'rb')
  fd.seek(offset)
  fd.readline()  # Read and ignore
  return fd.readline()

getrandomline("shaks12.txt")

Caveat: The above code will never choose 1st line and
will return '' for last line. Other than the boundary
conditions it will work well (even for large files). 

Interestingly :

On modifying this code to take in file object rather
than filename, the performance improved by ~50%. On
wrapping it in a class, it further improved by ~25%.

On executing the get random line 100,000 times on
large file (size 2707519 with 9427 lines), the class
version finished < 5 seconds.

Platform : 2GHz Intel Core 2 Duo macBook (2GB RAM)
running Mac OSX (10.4.10).

Output using python 2.5.1 (stackless)

Approach using enum approach : 9.55798196793 : for
[100] iterations
Approach using filename : 11.552863121 : for [100000]
iterations
Approach using file descriptor : 5.97015094757 : for
[100000] iterations
Approach using class : 4.46039891243 : for [100000]
iterations

Output using python 2.3.5 (default python on OSX)

Approach using enum approach : 12.2886080742 : for
[100] iterations
Approach using filename : 12.5682640076 : for [100000]
iterations
Approach using file descriptor : 6.55952501297 : for
[100000] iterations
Approach using class : 5.35413718224 : for [100000]
iterations

I am attaching test program FYI.

--
Aditya

--- Nathan Coulter
<com.gmail.kyleaschmitt at pooryorick.com> wrote:

> >  -------Original Message-------
> >  From: Tiger12506 <keridee at jayco.net>
> 
> >  Yuck. Talk about a one shot function! Of course
> it only reads through the
> >  file once! You only call the function once. Put a
> second print randline(f)
> >  at the bottom of your script and see what happens
> :-)
> >  
> >  JS
> >  
> 
> *sigh*
> 
> #!/bin/env python
> 
> import os
> import random
> 
> text = 'shaks12.txt'
> if not os.path.exists(text):
>   os.system('wget
> http://www.gutenberg.org/dirs/etext94/shaks12.txt')
> 
> def randline(f):
>     for i,j in enumerate(file(f, 'rb')):
>         if random.randint(0,i) == i:
>             line = j
>     return line
> 
> print randline(text)
> print randline(text)
> print randline(text)
> 
> -- 
> Yorick
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 



 
____________________________________________________________________________________
Sucker-punch spam with award-winning protection. 
Try the free Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/features_spam.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randline.py
Type: text/script
Size: 2039 bytes
Desc: 941365746-randline.py
Url : http://mail.python.org/pipermail/tutor/attachments/20070719/cf073b56/attachment.bin 


More information about the Tutor mailing list