[Tutor] vcf_files and strings
Steven D'Aprano
steve at pearwood.info
Mon Oct 10 04:04:50 CEST 2011
Anna Olofsson wrote:
> Hi,
>
> I'm a beginner at Python and would really like some help in how to
> extract information from a vcf file.
>
> The attached file consists of a lot of information on mutations, this
> one though is just 2 rows and 10 columns (the real one has a lot more
> rows).
What do you mean by a VCF file? On my computer, a VCF file is an
electronic business card, which tries to open in an Address Book
application (which obviously fails).
I don't know how to interpret the contents of your VCF file. After
opening it in a hex editor, I can *guess* that it is a tab-separated
file: each row takes one line, with the columns separated by tab
characters. Column 7 appears to be a great big ugly blob with sub-fields
separated by semi-colons. Am I right? Can you link us to a description
of the vcf file format?
> I want to extract the mRNA ID only if the mutation is missense. These
> two rows (mutations) that I have attached happens to be missense but
> how do I say that I'm not interested in the mutations that's not
> missense (they might be e.g. synonymous). Also, how do I say that if
> a mutation starts with a # symbol I don't want to include it
> (sometimes the chr starts with a hash).
What chr? Where is the mutation? I'm afraid your questions are assuming
familiarity with your data that we don't have.
> vcf file: 2 rows, 10 columns.
> col 0 col 1 col 2
> col 3 col 4 col5 col6
> col7 col8
> col9 chromosome position .
> Reference ALT position . some statistics
> and the ID:s not important not important
>
> The important column is 7 where the ID is, i.e.
> refseq.functionalClass=missense. It's a missense mutation, so then I
> want to extract refseq.name=NM_003137492, or I want to extract only
> the ID, which in this case is NM_003137492.
This is what I *think* you want to do. Am I right?
* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a
list for later
(I have completely ignored the part about the #, because I don't
understand what you mean by it.)
Here's some code to do it:
ids = []
f = open('vcf_file.vcf', 'r')
for row in f:
columns = row.split('\t') # Split on tabs
data = columns[7] # Huge ugly blob of data
values = data.split(';') # Split on semi-colons
if values[25] == "refseq.functionalClass=missense":
name_chunk = values[28] # looks like "refseq.name=..."
a, b = name_chunk.split("=")
if a != "refseq.name":
raise ValueError('expected refseq.name but got %s' % a)
ids.append(b)
f.close()
print(ids)
Does this help?
--
Steven
More information about the Tutor
mailing list