[Tutor] Parse files and create sqlite3 db
jarod_v6 at libero.it
jarod_v6 at libero.it
Wed Dec 3 10:36:36 CET 2014
Dear all
I aattach the file I want to parse and I wanto to create a sqlite database.
The problem is I'm not able to create in the right way because there is some logic problems on my script. I have this code:
import sqlite
import sqlite3
import os
# Creates or opens a file called mydb with a SQLite3 DB
db = sqlite3.connect('Fastqcd_completo.db')
cursor = db.cursor()
create_table_sql = """
create table fastqc_summary (
fileid varchar,
module varchar,
status varchar,
total int,
duplicate varchar
);"""
cursor.execute(create_table_sql)
create_table_sql2="""
create table fastqc_details (
id serial primary key,
fileid varchar,
module varchar,
col1 varchar,
ocols varchar
);
"""
cursor.execute(create_table_sql2)
db.commit()
for root, dirs, files in os.walk("/home/mauro/Desktop/LAVORO_CRO/2014/Statitica_RNAseqalign/FASTQC_completo/fastqcdecembre/"): # walk a r
for name in files:
if (name == "fastqc_data.txt"):
fileid = name # use string slicing here if you only want part of the
with open(os.path.join(root,name),"r") as p: # automatically close the file when done
for i in p:
line =i.strip()
if "Filename" in line:
fileid = line.split()[1]
if "Total Sequence" in line:
total = line.split()[2]
if "Total Duplicate" in line:
dup = line.split()[3]
if (line[:2] == ">>" and line[:12] != ">>END_MODULE"):
module = line[2:-5] # grab module name
status = line[-4:] # and overall status pass/warn/fail
sql = "insert into fastqc_summary(fileid,module,status,total,duplicate) values(?,?,?,?,?);"
data = (fileid,module,status,total,dup)
cursor.execute(sql,data)
elif (line[:2] != ">>" and line[:2] != "##"): # grab details under each module
cols = line.split("\t")
col1 = cols[0]
ocols = "|".join(cols[1:])
sql = "insert into fastqc_details(fileid,module,col1,ocols) values(?,?,?,?);"
data = (fileid,module,col1,ocols)
cursor.execute(sql,data)
db.commit()
So the problem is how to excatct only some parts of the files. In red are the point of the problems. The say the Filename are not defined On the file atached I want to take this part and use for create the database:
##FastQC 0.10.1
>>Basic Statistics pass
#Measure Value
Filename R05_CTTGTA_L004_R1_001.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 27868496
Filtered Sequences 0
Sequence length 50
%GC 50
>>END_MODULE
How can I resolve this problem? I need to use the number of rows?
bw,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20141203/736763eb/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fastqc_data.txt
URL: <http://mail.python.org/pipermail/tutor/attachments/20141203/736763eb/attachment-0001.txt>
More information about the Tutor
mailing list