[Tutor] vcf_files and strings

Mon Oct 10 04:04:50 CEST 2011

Anna Olofsson wrote:
> Hi,
> 
> I'm a beginner at Python and would really like some help in how to
> extract information from a vcf file.
> 
> The attached file consists of a lot of information on mutations, this
> one though is just 2 rows and 10 columns (the real one has a lot more
> rows).

What do you mean by a VCF file? On my computer, a VCF file is an 
electronic business card, which tries to open in an Address Book 
application (which obviously fails).

I don't know how to interpret the contents of your VCF file. After 
opening it in a hex editor, I can *guess* that it is a tab-separated 
file: each row takes one line, with the columns separated by tab 
characters. Column 7 appears to be a great big ugly blob with sub-fields 
separated by semi-colons. Am I right? Can you link us to a description 
of the vcf file format?

> I want to extract the mRNA ID only if the mutation is missense. These
> two rows (mutations) that I have attached happens to be missense but
> how do I say that I'm not interested in the mutations that's not
> missense (they might be e.g. synonymous).   Also, how do I say that if
> a mutation starts with a # symbol I don't want to include it
> (sometimes the chr starts with a hash).

What chr? Where is the mutation? I'm afraid your questions are assuming 
familiarity with your data that we don't have.

> vcf file: 2 rows, 10 columns.
> col 0                         col 1            col 2
> col 3              col 4      col5            col6
> col7                                     col8
> col9 chromosome          position           .
> Reference       ALT      position          .          some statistics
> and the ID:s         not important        not important
> 
> The important column is 7 where the ID is, i.e.
> refseq.functionalClass=missense. It's a missense mutation, so then I
> want to extract refseq.name=NM_003137492, or I want to extract only
> the ID, which in this case is NM_003137492.

This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a 
list for later

(I have completely ignored the part about the #, because I don't 
understand what you mean by it.)

Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
     columns = row.split('\t')  # Split on tabs
     data = columns[7]  # Huge ugly blob of data
     values = data.split(';')  # Split on semi-colons
     if values[25] == "refseq.functionalClass=missense":
         name_chunk = values[28]  # looks like "refseq.name=..."
         a, b = name_chunk.split("=")
         if a != "refseq.name":
             raise ValueError('expected refseq.name but got %s' % a)
         ids.append(b)

f.close()
print(ids)

Does this help?

-- 
Steven