[Tutor] please help formating

Kent Johnson kent37 at tds.net
Fri May 25 12:45:31 CEST 2007


kumar s wrote:
> hi group,
> 
> i have a data obtained from other student(over 100K)
> lines that looks like this:
> (39577484, 39577692) [['NM_003750']]
> (107906, 108011) [['NM_002443']]
> (113426, 113750) [['NM_138634', 'NM_002443']]
> (106886, 106991) [['NM_138634', 'NM_002443']]
> (100708, 100742) [['NM_138634', 'NM_002443']]
> (35055935, 35056061) [['NM_002313', 'NM_001003407',
> 'NM_001003408']]
> 
> I know that first two items in () are tuples, and the
> next [[]] a list of list. I was told that the tuples
> were keys and the list was its value in a dictionary.
> 
> how can I parse this into a neat structure that looks
> like this:
> 39577484, 39577692 \t NM_003750
> 107906, 108011 \t NM_002443
> 113426, 113750 \t  NM_138634,NM_002443
> 106886, 106991 \t  NM_138634,NM_002443
> 100708, 100742 \t  NM_138634,NM_002443
> 35055935, 35056061 \t
> NM_002313,NM_001003407,NM_001003408

How about this (assuming the line wrap at the end was done by mail; if 
it is in the data it is a little harder to parse it):

data = '''(39577484, 39577692) [['NM_003750']]
(107906, 108011) [['NM_002443']]
(113426, 113750) [['NM_138634', 'NM_002443']]
(106886, 106991) [['NM_138634', 'NM_002443']]
(100708, 100742) [['NM_138634', 'NM_002443']]
(35055935, 35056061) [['NM_002313', 'NM_001003407', 'NM_001003408']]
'''.splitlines()

import re
for line in data:
     match = re.match(r'\((\d*), (\d*)\) \[\[(.*)\]\]', line)
     if match:
         t1, t2, data = match.group(1, 2, 3)
         data = data.replace("'", "").replace(' ', '')
         print '%s %s\t%s' % (t1, t2, data)
     else:
         print 'no match:', line


Note the format of the data here is different from what you showed in 
your post last night...

Kent


More information about the Tutor mailing list