[Tutor] vcf_files and strings

Hs Hs ilhs_hs at yahoo.com
Tue Oct 11 19:16:35 CEST 2011

VCF - Variant Call Format

VCF files are nothing special but tab delim files describing the genetic mutations, friquencies and other base information (bases here mean ATGC pertaining to DNA). 

These files are generated by variety of genome sequence data analysis pipelines.  MIT and Haplotype Mapping Project consortium developed this format. 

Nothing special except that a genetics or any biologist will understand this - nothing special!



From: Steven D'Aprano <steve at pearwood.info>
To: tutor at python.org
Sent: Sunday, October 9, 2011 10:04 PM
Subject: Re: [Tutor] vcf_files and strings

Anna Olofsson wrote:
> Hi,
> I'm a beginner at Python and would really like some help in how to
> extract information from a vcf file.
> The attached file consists of a lot of information on mutations, this
> one though is just 2 rows and 10 columns (the real one has a lot more
> rows).

What do you mean by a VCF file? On my computer, a VCF file is an electronic business card, which tries to open in an Address Book application (which obviously fails).

I don't know how to interpret the contents of your VCF file. After opening it in a hex editor, I can *guess* that it is a tab-separated file: each row takes one line, with the columns separated by tab characters. Column 7 appears to be a great big ugly blob with sub-fields separated by semi-colons. Am I right? Can you link us to a description of the vcf file format?

> I want to extract the mRNA ID only if the mutation is missense. These
> two rows (mutations) that I have attached happens to be missense but
> how do I say that I'm not interested in the mutations that's not
> missense (they might be e.g. synonymous).   Also, how do I say that if
> a mutation starts with a # symbol I don't want to include it
> (sometimes the chr starts with a hash).

What chr? Where is the mutation? I'm afraid your questions are assuming familiarity with your data that we don't have.

> vcf file: 2 rows, 10 columns.
> col 0                         col 1            col 2
> col 3              col 4      col5            col6
> col7                                     col8
> col9 chromosome          position           .
> Reference       ALT      position          .          some statistics
> and the ID:s         not important        not important
> The important column is 7 where the ID is, i.e.
> refseq.functionalClass=missense. It's a missense mutation, so then I
> want to extract refseq.name=NM_003137492, or I want to extract only
> the ID, which in this case is NM_003137492.

This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a list for later

(I have completely ignored the part about the #, because I don't understand what you mean by it.)

Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
    columns = row.split('\t')  # Split on tabs
    data = columns[7]  # Huge ugly blob of data
    values = data.split(';')  # Split on semi-colons
    if values[25] == "refseq.functionalClass=missense":
        name_chunk = values[28]  # looks like "refseq.name=..."
        a, b = name_chunk.split("=")
        if a != "refseq.name":
            raise ValueError('expected refseq.name but got %s' % a)


Does this help?

-- Steven
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20111011/5bb6ed05/attachment.html>

More information about the Tutor mailing list