find overlapping lines & output times observed
Oscar Benjamin
oscar.j.benjamin at gmail.com
Mon May 6 16:32:33 EDT 2013
On 6 May 2013 19:39, Linsey Raaijmakers <lm.raaijmakers at gmail.com> wrote:
> I have a file like this:
> action start end
> 50 5321 5321
> 7 5323 5347
> 12 5339 5351
> 45 5373 5373
> 45 5420 5420
> 25 5425 5425
[snip]
your code below suggests that your file also has an "apex" column. If
that's correct what do you what is it and what do you want to do with
it?
[snip]
> I have a script now that identifies overlap between two actions (see bottom page), but how can I change this so that it outputs all possible combinations?
I looked at that code and I don't think it does what you describe. Are
you sure that it works?
> My desired output would be:
>
> action times observed apex
> 50 5 5321, 5451, 5533, 5634, 5740
> 50,45 1 5533;5533
> 7 4 5347, 5689, 5688, 5845
> 7,25 2 5347;5425, 5689;5682
> 7,25,26 1 5689;5682;5690
>
> CODE:
>
> from collections import Counter
> f = open('and.txt','r');
>
> action_list = []
> onset_list = []
> apex_list = []
> offset_list = []
> action_counter = 0
> combination_list = []
>
>
> for line in f:
> fields = line.split("\t")
> for col in fields:
> action = fields[0]
> onset = fields[1]
> apex = fields[2]
> offset = fields[3]
The above code suggests that the file has four columns. Also since
you're not actually using the loop variable "col" you can just delete
the "for" line and unindent the rest. In fact the easiest way to do
all of this is just:
action, onset, apex, offset = line.split()
>
> action_list.append(action)
> onset_list.append(onset)
> apex_list.append(apex)
> offset_list.append(offset)
>
> action_cnvrt = map(int, action_list)
> c = Counter(action_cnvrt)
>
> filtered = list(set(action_list))
There's no need to convert this back to a list if you're just going to
iterate over it again with map.
> filtered_cnvrt = map(int, filtered)
>
> for a in filtered_cnvrt:
> action_count = str(a)+"\t"+str(c[a])
> print action_count
The above counts the number of times each event occurs which is one
part of your desired output.
>
> for i in range (0,len(onset_list)):
> combination_list.append(action_list[i])
> for j in range(0,len(apex_list)):
> if i != j:
> if onset_list[j]>= onset_list[i] and apex_list[j] <= apex_list[i]:
> print action_list[j]+","+action_list[i]+'\t'+onset_list[j]+'\t'+apex_list[j]+'\t'+onset_list[i]+'\t'+apex_list[i]
What is combination_list for? It should just end up containing the
same thing as action_list if I understand correctly.
It's generally better in Python to loop directly over things rather
than using indices so, instead of something like:
for i in range(len(onset_list)):
print offset_list[i] - onset_list[i]
you should do something like
for offset, onset in zip(offset_list, onset_list):
print offset - onset
and your code will be a lot clearer.
The algorithm you are using is to loop over all events and then to
loop over all other events comparing all pairs of events. This will
not scale very well if you want to look at a large file or to compare
simultaneous occurrences of more than two events.
It looks as if your input data is ordered by the onset column. Is that
the case? If so then you can use an algorithm that just loops once
over all the events. The way the algorithm works is that you store
which events are currently active and loop through the events keeping
track of the start time of the most recently added event adding and
removing events as they start and stop. In pseudocode:
now = start of first event
active = [first event]
for next_starting in events (not including first):
next_ending = event that ends soonest from active
while start of next_starting > end of next_ending:
report active events from now to end of next_ending
now = end of next_ending
remove next_ending from active
report active events from now until start of next_starting
now = start of next_starting
add next_starting to active
And some more code to deal with what happens when you get to the end
of the list of events...
The report steps probably mean adding to a Counter or dict to remember
that the currently active events were active during each particular
time window.
Oscar
More information about the Python-list
mailing list