Seek the one billionth line in a file containing 3 billion lines.

Evan Klitzke evan at yelp.com
Wed Aug 8 02:38:28 EDT 2007


On 8/7/07, Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> wrote:
> I have a huge log file which contains 3,453,299,000 lines with
> different lengths. It is not possible to calculate the absolute
> position of the beginning of the one billionth line. Are there
> efficient way to seek to the beginning of that line in python?
>
> This program:
> for i in range(1000000000):
>       f.readline()
> is absolutely every slow....
>
> Thank you so much for help.

There is no fast way to do this, unless the lines are of fixed length
(in which case you can use f.seek() to move to the correct spot). The
reason is that there is no way to find the position of the billionth
line without scanning the whole file. You should split your logs into
smaller files in the future.

You might be able to do this a very tiny bit faster by using the split
utility and have it split the log file into smaller chunks (split can
split by line amounts), but since that still has to scan the file it
will will be IO bound.

-- 
Evan Klitzke <evan at yelp.com>



More information about the Python-list mailing list