[Tutor] sorting a 2 gb file- i shrunk it and turned it around

Scott Melnyk melnyk at gmail.com
Tue Jan 25 17:26:52 CET 2005


Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).

After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.

I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.

The format and what info is compared against what is different from my
original examples as I believe this is more clear.


my queries are named by the lines such as:
ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
ENSE is an exon       ENSG is the gene     ENST is a transcript

They all have the above format, they differ in in numbers above
following ENS[E,G orT].

Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST00000339871.1

in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 



the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...    44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    44  0.001

"hg17_chainMm5_chr7_random range=chr10:124355404-124355687" is the
important part here from "5'pad" on is not important at this point


What I am trying to do is now make a list of any of the results that
appear in more than one transcript

##########
FILE SAMPLE:

This is the number 1  query tested.
Results for scoring against Query=
ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...    44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...    44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...    44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...    44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...    44  0.001

....

This is the number 3  query tested.
Results for scoring against Query=
ENSE00001365999.1|ENSG00000187908.1|ENST00000339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...    60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...    60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...    60  2e-08

##############

I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE00001365999.1|ENSG00000187908.1|ENST00000339871.1)


so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.

Any suggestions?
I agree if I had more time and was going to be doing more of this the
DB is the way to go.
-As an aside I have not looked into sqlite, I am hoping to avoid the
db right now, I'd have to get the sys admin to give me permission to
install something again etc etc. Where as I am hoping to get this
together in a reasonably short script.

 However I will look at it later (it could be helpful for other things for me.

Thanks again to all,  
Scott


More information about the Tutor mailing list