[Tutor] Parse files and create sqlite3 db

jarod_v6 at libero.it jarod_v6 at libero.it
Wed Dec 3 10:36:36 CET 2014


Dear all
I aattach the file I want to parse and I wanto to create a sqlite database.
The problem is I'm not able to create in the right way because there is some logic problems on my script. I have  this code:


import sqlite
import sqlite3
import os

# Creates or opens a file called mydb with a SQLite3 DB
db = sqlite3.connect('Fastqcd_completo.db')
cursor = db.cursor()
create_table_sql = """
create table fastqc_summary (
fileid varchar,
module varchar,
status varchar,
total int,
duplicate varchar
);"""
cursor.execute(create_table_sql)
create_table_sql2="""
create table fastqc_details (
id serial primary key,
fileid varchar,
module varchar,
col1 varchar,
ocols varchar
);
"""
cursor.execute(create_table_sql2)
db.commit()

for root, dirs, files in os.walk("/home/mauro/Desktop/LAVORO_CRO/2014/Statitica_RNAseqalign/FASTQC_completo/fastqcdecembre/"): # walk a r
    for name in files:
        if (name == "fastqc_data.txt"):
            fileid = name # use string slicing here if you only want part of the
 
            with open(os.path.join(root,name),"r") as p: # automatically close the file when done
                for i in p:
                    line =i.strip()
                    
                    if "Filename" in line:
                        fileid = line.split()[1]
                    if "Total Sequence" in line:
                        total = line.split()[2]
                        
                    if "Total Duplicate" in line:
                        dup = line.split()[3]
                                        
                    if (line[:2] == ">>" and line[:12] != ">>END_MODULE"):
                        module = line[2:-5] # grab module name
                        status = line[-4:] # and overall status pass/warn/fail
                        sql = "insert into fastqc_summary(fileid,module,status,total,duplicate) values(?,?,?,?,?);"
                        data = (fileid,module,status,total,dup)
                        cursor.execute(sql,data)
                    elif (line[:2] != ">>" and line[:2] != "##"): # grab details under each module
                        cols = line.split("\t")
                        col1 = cols[0]
                        ocols = "|".join(cols[1:])
                        sql = "insert into fastqc_details(fileid,module,col1,ocols) values(?,?,?,?);"
                        data = (fileid,module,col1,ocols)
                        cursor.execute(sql,data)
db.commit()


So the problem is how to excatct only some parts of the files. In red are the point of the problems. The say the Filename are not defined On the file atached I want to take this part and use for create the database:
##FastQC    0.10.1
>>Basic Statistics    pass
#Measure    Value    
Filename    R05_CTTGTA_L004_R1_001.fastq.gz    
File type    Conventional base calls    
Encoding    Sanger / Illumina 1.9    
Total Sequences    27868496    
Filtered Sequences    0    
Sequence length    50    
%GC    50    
>>END_MODULE
How can I resolve this problem? I need to use the number of rows?
bw,



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20141203/736763eb/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fastqc_data.txt
URL: <http://mail.python.org/pipermail/tutor/attachments/20141203/736763eb/attachment-0001.txt>


More information about the Tutor mailing list