[Tutor] Can't transform a list of tokens into a text
Eduardo Vieira
eduardo.susan at gmail.com
Thu Jul 16 04:33:13 CEST 2009
Hello, I have a file that was a resulted from a POS-Tagging program,
after some transformations, I wanted to restore to it's normal form.
So, I used sed to remove the POS-Tags and have something like this:
--- Example begins
No
,
thanks.+ at + # this + at + I inserted to mark paragraphs, because the
POS-Tagger don't keep paragraph marks
He
said
,
"
No
,
thanks
.
"
OK
?
--- Example ends
--- And I want to transform into this:
No, thanks.
He said, "No, thanks." OK?
--- End example
I tried different approaches, and dealing with the quote marks is what
is defeating my attempts:
I had originally this code:
import cStringIO
##import string
myoutput = cStringIO.StringIO()
f = open(r'C:\mytools\supersed\tradunew.txt')
lista = [line.strip() for line in f]
punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):
if item == '"' and lista[i + 1] not in punct:
myoutput.write(item)
spacer = True
elif '+ at +' in item:
donewline = item.replace('+ at +','\n ')
myoutput.write(donewline)
elif item not in punct and lista[i + 1] in punct:
myoutput.write(item)
elif item in punct and lista[i + 1] in punct:
myoutput.write(item)
elif item in punct and lista[i + 1] == '"' and spacer:
myoutput.write(item)
spacer = False
elif item not in punct and lista[i + 1] == '"' and spacer:
myoutput.write(item)
spacer = False
elif item in '([{“':
myoutput.write(item)
else:
myoutput.write(item + " ")
newlist = myoutput.getvalue().splitlines()
myoutput.close()
f = open(r'C:\mytools\supersed\traducerto-k.txt', 'w')
for line in newlist:
f.write(line.lstrip()+'\n')
f.close()
#===
I tried this version to post in this forum but that gives me an error.
I don't know why I don't get an error with the code above which is
essentially the same:
# -*- coding: cp1252 -*-
result = ''
lista = [
'No', ',', 'thanks.+ at +',
'He', 'said', ',', '"', 'no', ',', 'thanks', '.', '"', 'OK', '?', 'Hi']
punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):
if item == '"' and lista[i + 1] not in punct:
result +=item
spacer = True
elif '+ at +' in item:
donewline = item.replace('+ at +','\n ')
result += donewline
elif item not in punct and lista[i + 1] in punct:
result += item
elif item in punct and lista[i + 1] in punct:
result += item
elif item in punct and lista[i + 1] == '"' and spacer:
result += item
spacer = False
elif item not in punct and lista[i + 1] == '"' and spacer:
result += item
spacer = False
elif item in '([{“':
result += item
else:
result += (item + " ")
print result
#==
The error is this:
Traceback (most recent call last):
File "<string>", line 244, in run_nodebug
File "C:\mytools\jointags-v4.py", line 17, in <module>
elif item not in punct and lista[i + 1] in punct:
IndexError: list index out of range
I'm using python 2.6.2 with PyScripter IDE
I have tried a so many variations that I'm not sure what I'm doing any more....
I'm just trying to avoid some post-processing with sed again.
Thankful,
Eduardo
More information about the Tutor
mailing list