[Baypiggies] Fwd: manipulating lists question

Martin Falatic martin at falatic.com
Thu Dec 5 12:21:05 CET 2013


One can leave that as an exercise for the reader. :-)

I'm not sure why this gets a ';' versus a ',', nor is it clear if these
field lists are supposed to be deduped or ordered or what... consider if
you have three sets of data for 1302, and two ONLY vary by this field 28.
If you go to reconstruct the data set you end up with a somewhat mangled
thing.

I take it this is an effort to compress the original data set to a more
manageable size for output / internal representation, which suggests
deduping isn't desirable (but which also suggests that you should simply
include every instance for fields 1 (which we already do) and 28, and hope
none of the other fields vary).

On that note I'll throw this idea out there: given key field 0 as
identical for n sets of data, for every subsequent field [1:] if the item
is a str or int, consider it duplicated for all n sets. If the item is a
list then it much have exactly n elements (in the order the n sets were
parsed). That way if another fields is found to vary unexpectedly, it'll
simply become a list of n elements (many might be the same).

You can always take that and stringify the elements and lists for storage
or whatever. The idea is that your internal data representation is a much
easier to work with set of lists/strs/ints within dictionary entries.


 - Marty


On Thu, December 5, 2013 03:06, Vikram K wrote:
> Good catch. All the other elements remain the same except this one.
> Element
> 28 needs to be changed (in the merged/collapsed list) so that when we fuse
>  or merge two elements of the larger list into one then Element 28 of the
>  new element is (just combine whatever is present in element 28 in both
> the lists keeping a ';' as delimiter):
>
> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'; '
> 1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>
>
>
>
> On Thu, Dec 5, 2013 at 5:51 AM, Martin Falatic <martin at falatic.com>
> wrote:
>
>
>> My solution works for the first three elements as stated, but what you
>> do with the rest of the elements is tricky if they differ for a given
>> key.
>>
>> For 1302 all the fields in the slice [3:] match each other *except*
>> element 28: '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>>
>>
>> Does this potentially happen with other elements at times? At this
>> point you're faced with either discarding data or mangling data
>> together. The "collapse" just takes the last [3:] slice encountered (for
>> that remainder of data). Is that acceptable?
>>
>> - Marty
>>
>>
>>
>> On Thu, December 5, 2013 02:33, Vikram K wrote:
>>
>>> In the example i have given, the second and third elements of the
>>> larger list (comp[7] and comp[8]) have a 1:1 mapping after the second
>>> element.
>> So
>>
>>> i would like to keep the first element as it is and then collapse or
>>> merge the second and third elements (comp[7] and comp[8]) into a
>>> single element:
>>>
>>>
>>>
>>>>>> comp[6]
>>> ['6558', 'NM_001046.2', 'SLC12A2', '6037226', '2', 'chr5',
>>> '127502453',
>>> '127502454', 'het-ref', 'snp', 'A', 'T', 'A', '185', '113', '184',
>>> '112',
>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '259974', '9', '6', '6', '15',
>>> '6558:NM_001046.2:SLC12A2:CDS:MISSENSE',
>>> '6558:NM_001046.2:SLC12A2:CDS:NO-CHANGE', 'PFAM:PF01490:Aa_trans', '',
>>>
>> '',
>>
>>> '', '0.99', '2', '0.99', '0.998', '1.01', '1.000', '0.5', '0.46',
>>> '0.5',
>>> '1', '18', '18', '19', 'ref-identical;onlyA', 'snp', '0.072', '-1',
>>> 'SQHIGH']
>>>
>>>
>>>
>>>>>> comp[7]
>>> ['1302', 'NM_080679.2', 'COL11A2', '6525172', '2', 'chr6',
>>> '33271374',
>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>> '458',
>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>> '140',
>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>
>>>
>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:COL
>> 11A
>>
>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UNKN
>>> OWN-
>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000', '0.46',
>>>  '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>> 'del',
>>> '0.990', '6', 'SQHIGH']
>>>
>>>
>>>
>>>>>> comp[8]
>>> ['1302', 'NM_080680.2', 'COL11A2', '6525172', '2', 'chr6',
>>> '33271374',
>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>> '458',
>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>> '140',
>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>
>>>
>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:COL
>> 11A
>>
>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UNKN
>>> OWN-
>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000', '0.46',
>>>  '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>> 'del',
>>> '0.990', '6', 'SQHIGH']
>>>
>>>
>>>
>>> After collapsing comp[7] and comp[8] i  get:
>>>
>>>
>>>
>>>>>> collapsed = ['1302', 'NM_080679.2,NM_080680.2', 'COL11A2',
>>>>>> '6525172',
>>>>>>
>>>>>>
>>> '2', 'chr6', '33271374', '33271376', 'het-ref', 'del', 'GT', '',
>>> 'GT',
>>> '542', '542', '458', '458', 'VQHIGH', 'VQHIGH', '', '', '', '',
>>> '71150',
>>> '34', '106', '106', '140',
>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>
>>>
>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:COL
>> 11A
>>
>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UNKN
>>> OWN-
>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000', '0.46',
>>>  '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>> 'del',
>>> '0.990', '6', 'SQHIGH']
>>>
>>>
>>>
>>> So in my larger list, after the modification, comp[6] is the first
>>> element and collapsed the second element.
>>>>>>
>>>
>>>
>>> On Thu, Dec 5, 2013 at 5:22 AM, Martin Falatic <martin at falatic.com>
>>> wrote:
>>>
>>>
>>>
>>>> Ah, genetics! Intriguing...
>>>>
>>>>
>>>>
>>>> Do you need anything beyond the third elements of each list? Does
>>>> the third element always map 1:1 with the first, or could it vary?
>>>> If so,
>>>> what then?
>>>>
>>>> To refer to the simplified example, could you have this?
>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 43], ['dog', 'NM56',
>>>> 65]]
>>>>
>>>>
>>>> If so, what is the expected output?
>>>>
>>>>
>>>>
>>>> - Marty
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, December 5, 2013 02:11, Vikram K wrote:
>>>>
>>>>
>>>>> i am having some difficulty in applying this to my actual problem
>>>>>  although i love the dictionary method. Imagine the following
>>>>> three lists are the first, second and third elements of a larger
>>>>> list:
>>>>>
>>>>>
>>>>>>>> comp[6]
>>>>> ['6558', 'NM_001046.2', 'SLC12A2', '6037226', '2', 'chr5',
>>>>> '127502453',
>>>>> '127502454', 'het-ref', 'snp', 'A', 'T', 'A', '185', '113', '184',
>>>>>  '112',
>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '259974', '9', '6', '6', '15',
>>>>>  '6558:NM_001046.2:SLC12A2:CDS:MISSENSE',
>>>>> '6558:NM_001046.2:SLC12A2:CDS:NO-CHANGE', 'PFAM:PF01490:Aa_trans',
>>>>> '',
>>>>>
>>>>>
>>>> '',
>>>>
>>>>
>>>>> '', '0.99', '2', '0.99', '0.998', '1.01', '1.000', '0.5', '0.46',
>>>>>  '0.5',
>>>>> '1', '18', '18', '19', 'ref-identical;onlyA', 'snp', '0.072',
>>>>> '-1',
>>>>> 'SQHIGH']
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>> comp[7]
>>>>> ['1302', 'NM_080679.2', 'COL11A2', '6525172', '2', 'chr6',
>>>>> '33271374',
>>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>>>>  '458',
>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>>> '140',
>>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>>
>>>>>
>>>>>
>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2
>>>> :COL
>>>> 11A
>>>>
>>>>
>>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:
>>>>> UNKN
>>>>> OWN-
>>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>>> '0.46',
>>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>>> 'del',
>>>>> '0.990', '6', 'SQHIGH']
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>> comp[8]
>>>>> ['1302', 'NM_080680.2', 'COL11A2', '6525172', '2', 'chr6',
>>>>> '33271374',
>>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>>>>  '458',
>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>>> '140',
>>>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>>
>>>>>
>>>>>
>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2
>>>> :COL
>>>> 11A
>>>>
>>>>
>>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:
>>>>> UNKN
>>>>> OWN-
>>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>>> '0.46',
>>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>>> 'del',
>>>>> '0.990', '6', 'SQHIGH']
>>>>>
>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>> ------
>>>>> Can we apply the dictionary method to the problem where the key of
>>>>> the dictionary is the first element of the three smaller lists
>>>> ('6558','1302',
>>>>
>>>>
>>>>> '1302'). The second and third elements of the larger list
>>>>> (starting
>>>>> with '1302') need to be collapsed into a single element, based on
>>>>> their second element ( 'NM_080679.2') and ('NM_080680.2') in a
>>>>> way similar to how we had tackled the toy problem:
>>>>>
>>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 12], ['dog', 'NM56',
>>>>> 65]]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 5, 2013 at 4:18 AM, Michiel Overtoom
>>>>> <motoom at xs4all.nl>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Dec 5, 2013, at 10:09, Vikram K wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> another option could have been to obtain a dictionary like
>>>>>>> so:
>>>>>>> {'dog':
>>>>>>> ['NM56', 65], 'cat': ['NM123,NM234', 12]}
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Oh, in that case the code can become somewhat simpler:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 12], ['dog',
>>>>>> 'NM56',
>>>>>> 65]]
>>>>>>
>>>>>>
>>>>>>
>>>>>> d = {} for key, label, quant in x: if key in d: d[key][0] += ",
>>>>>> " +
>>>>>>
>>>>>>
>>>> label
>>>>>> else:
>>>>>> d[key] = [label, quant]
>>>>>>
>>>>>> print d
>>>>>>
>>>>>>
>>>>>> I agree with Michael that the problem is somewhat
>>>>>> underspecified, but it's a starting point.
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> "If you don't know, the thing to do is not to get scared, but to
>>>>>>  learn." - Ayn Rand
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Baypiggies mailing list
>>>>> Baypiggies at python.org
>>>>> To change your subscription options or unsubscribe:
>>>>> https://mail.python.org/mailman/listinfo/baypiggies
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>




More information about the Baypiggies mailing list