[Tutor] Parse files and create sqlite3 db

Alan Gauld alan.gauld at btinternet.com
Wed Dec 3 11:24:07 CET 2014


On 03/12/14 09:36, jarod_v6 at libero.it wrote:
> Dear all
> I aattach the file I want to parse and I wanto to create a sqlite database.
> The problem is I'm not able to create in the right way because there is
> some logic problems on my script. I have  this code:

Do you get an error message? If so please send a cut n paste of the full 
error text.

> import sqlite
> import sqlite3

You only need the second of these imports

> import os
>
> # Creates or opens a file called mydb with a SQLite3 DB
> db = sqlite3.connect('Fastqcd_completo.db')
> cursor = db.cursor()
> create_table_sql = """
> create table fastqc_summary (
> fileid varchar,
> module varchar,
> status varchar,
> total int,
> duplicate varchar
> );"""
 > cursor.execute(create_table_sql)

This looks fine except its normal to have a unique key somewhere.
But if you are confident that you will have uniqueness in your
data you don't strictly need it.

> create_table_sql2="""
> create table fastqc_details (
> id serial primary key,

But this is odd. I can't find anything about a serial keyword for 
SQLite. The normal format of this would be

id INTEGER PRIMARY KEY

which makes id an auto-incrementing unique value.

> fileid varchar,
> module varchar,
> col1 varchar,
> ocols varchar
> );
> """
> cursor.execute(create_table_sql2)
> db.commit()

The other potential issue is that you are creating these tables each 
time you run the script. But if you run the script a second time the 
tables willalready exist and the CREATES will fail. You should either 
drop the tables at the top of the script or use the

CREATE TABLE IF NOT EXISTS <table> ...

format of create. Then if the tab;le already exists your data will
be appended. (If you want it overwritten use the drop table technique)

> for root, dirs, files in
> os.walk("/home/mauro/Desktop/LAVORO_CRO/2014/Statitica_RNAseqalign/FASTQC_completo/fastqcdecembre/"):
> # walk a r
>      for name in files:
>          if (name == "fastqc_data.txt"):

You are searching for a specific name. Could there be multiple such 
files or are you just using wa;lk to find one? If the latter it would be 
better to exit the os.walk loop once you find it and then process the 
file. You need to store the root and filename first of course.

>              fileid = name # use string slicing here if you only want
> part of the
>
>     with open(os.path.join(root,name),"r") as p: # automatically close
> the file when done
>                  for i in p:
>                      line =i.strip()
>
>                      if "Filename" in line:
>                          fileid = line.split()[1]
>                      if "Total Sequence" in line:
>                          total = line.split()[2]
>
>                      if "Total Duplicate" in line:
>                          dup = line.split()[3]
>
>                      if (line[:2] == ">>" and line[:12] != ">>END_MODULE"):
>                          module = line[2:-5] # grab module name
>                          status = line[-4:] # and overall status
> pass/warn/fail
>                          sql = "insert into
> fastqc_summary(fileid,module,status,total,duplicate) values(?,?,?,?,?);"
>                          data = (fileid,module,status,total,dup)
>                          cursor.execute(sql,data)
>                      elif (line[:2] != ">>" and line[:2] != "##"): #
> grab details under each module
>                          cols = line.split("\t")
>                          col1 = cols[0]
>                          ocols = "|".join(cols[1:])
>                          sql = "insert into
> fastqc_details(fileid,module,col1,ocols) values(?,?,?,?);"
>                          data = (fileid,module,col1,ocols)
>                          cursor.execute(sql,data)
> db.commit()
>
>
> So the problem is how to excatct only some parts of the files. In red
> are the point of the problems. The say the Filename are not defined

What says that? It's true that you don't have a variable Filename 
defined. But I don't see Filename in your code anywhere either. So I 
don't understand the error message. Can you post it in full please?

> the file atached I want to take this part and use for create the database:
> ##FastQC    0.10.1
>  >>Basic Statistics    pass
> #Measure    Value
> Filename    R05_CTTGTA_L004_R1_001.fastq.gz
> File type    Conventional base calls
> Encoding    Sanger / Illumina 1.9
> Total Sequences    27868496
> Filtered Sequences    0
> Sequence length    50
> %GC    50
>  >>END_MODULE


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list