finding homopolymers in both directions
Lee Sander
lesande at gmail.com
Tue Aug 3 13:34:48 EDT 2010
Hi,
Suppose I have a string such as this
'aabccccccefggggghiiijkr'
I would like to print out all the positions that are flanked by a run
of symbols.
So for example, I would like to the output for the above input as
follows:
2 b 1 aa
2 b -1 cccccc
10 e -1 cccccc
11 f 1 ggggg
17 h 1 iii
17 h -1 ggggg
where the first column is the position of interest, the next column is
the entry at that position,
1 if the following column refers to a runs that come after and -1 if
the runs come before
I can do this easily for forward (shown below) but not clear how to do
this
backwards.
I would really appreciate it if someone can help with this problem.
I feel like a regex solution would be possible but I am not too good
with regex.
The code for forward is as follows:
def homopolymericSites(Seq):
Seq=Seq.upper()
i=0
len_seq=len(Seq)-1# hack to prevent boundary condition
while i < len_seq:
bi=Seq[i]
k=1
# go to the start of a homopolymer
while 1:
if i+k >= len_seq: break # no more sequence left
if bi==Seq[i+k]:
k+=1
else:
break
if k>1: # homopolymer length
i=i+k
id_of_chr_which_proceeds_homopolymer=Seq[i] # note not i+1
pos_of_chr_which_proceeds_homopolymer=i+1 # +1 to convert it to 1-
index notation
id_of_homopolymer=Seq[i-1]
length_of_homopolymer=k
print "%s\t%s/%s\t%s" %(pos_of_chr_which_proceeds_homopolymer,
id_of_chr_which_proceeds_homopolymer, id_of_homopolymer,
length_of_homopolymer)
else:
i+=1
More information about the Python-list
mailing list