[Tutor] large file

Steven D'Aprano steve at pearwood.info
Mon Jun 14 03:08:36 CEST 2010

On Mon, 14 Jun 2010 07:45:45 am Hs Hs wrote:
> hi:
> I have a very large file 15Gb. Starting from 15th line this file 
> shows the following lines:
> HWUSI-EAS1211_0001:1:1:977:20764#0
> HWUSI-EAS1211_0001:1:1:977:20764#0
> HWUSI-EAS1521_0001:1:1:978:13435#0
> HWUSI-EAS1521_0001:1:1:978:13435#0

It looks to me that your file has a *lot* of redundant information. Does 
the first part of the line "HWUSI-EAS1211_0001:1:1:" ever change? If it 
does not, then you can save approximately 65% of the file size by just 
recording it once, instead of 400 million times:

[first 14 lines]
prefix = HWUSI-EAS1211_0001:1:1:

That will bring the file down from 15GB to less than 6GB, and speed up 
processing time and decrease storage requirements significantly.

> Every two lines are part of one readgroup. I want to add two
> variables to every line. First variable goes to all lines with odd
> numbers. Second variable should be appended to all even number lines.

How are these variables calculated? You need some way of automatically 
calculating them, perhaps by looking them up in a database. I don't 
know how you calculate them, so I will invent two simple stubs:

def suffix1(line):
    # Calculate the first suffix variable.
    return " RG:Z:2301"

def suffix2(line):
    # Calculate the first suffix variable.
    return " RG:Z:2302"

> Since I cannot read the entire file, I wanted to cat the file

Of course you can read the file, you just can't read it ALL AT ONCE. 
There's no need to use cat, you just have to read the file line by line 
and then do something with each line.

infile = open("myfile", "r")
outfile = open("output.sam", "w")
# Skip over the first 15 lines.
for i in range(15):
    infile.next()  # Read one line.

suffixes = [suffix1, suffix2]  # Store the function objects.
n = 0
# Process the lines.
for line in infile:
    line = line.strip()
    # Which suffix function do we want to call?
    suffix = suffixes[n]
    outfile.write(line + suffix(line) + '\n')
    n = 1 - n  # n -> 1, 0, 1, 0, 1, 0, ...


Steven D'Aprano

More information about the Tutor mailing list