[Tutor] please help formating
Kent Johnson
kent37 at tds.net
Fri May 25 12:45:31 CEST 2007
kumar s wrote:
> hi group,
>
> i have a data obtained from other student(over 100K)
> lines that looks like this:
> (39577484, 39577692) [['NM_003750']]
> (107906, 108011) [['NM_002443']]
> (113426, 113750) [['NM_138634', 'NM_002443']]
> (106886, 106991) [['NM_138634', 'NM_002443']]
> (100708, 100742) [['NM_138634', 'NM_002443']]
> (35055935, 35056061) [['NM_002313', 'NM_001003407',
> 'NM_001003408']]
>
> I know that first two items in () are tuples, and the
> next [[]] a list of list. I was told that the tuples
> were keys and the list was its value in a dictionary.
>
> how can I parse this into a neat structure that looks
> like this:
> 39577484, 39577692 \t NM_003750
> 107906, 108011 \t NM_002443
> 113426, 113750 \t NM_138634,NM_002443
> 106886, 106991 \t NM_138634,NM_002443
> 100708, 100742 \t NM_138634,NM_002443
> 35055935, 35056061 \t
> NM_002313,NM_001003407,NM_001003408
How about this (assuming the line wrap at the end was done by mail; if
it is in the data it is a little harder to parse it):
data = '''(39577484, 39577692) [['NM_003750']]
(107906, 108011) [['NM_002443']]
(113426, 113750) [['NM_138634', 'NM_002443']]
(106886, 106991) [['NM_138634', 'NM_002443']]
(100708, 100742) [['NM_138634', 'NM_002443']]
(35055935, 35056061) [['NM_002313', 'NM_001003407', 'NM_001003408']]
'''.splitlines()
import re
for line in data:
match = re.match(r'\((\d*), (\d*)\) \[\[(.*)\]\]', line)
if match:
t1, t2, data = match.group(1, 2, 3)
data = data.replace("'", "").replace(' ', '')
print '%s %s\t%s' % (t1, t2, data)
else:
print 'no match:', line
Note the format of the data here is different from what you showed in
your post last night...
Kent
More information about the Tutor
mailing list