[Tutor] Can't transform a list of tokens into a text

Eduardo Vieira eduardo.susan at gmail.com
Thu Jul 16 04:33:13 CEST 2009


Hello, I have a file that was a resulted from a POS-Tagging program,
after some transformations, I wanted to restore to it's normal form.
So, I used sed to remove the POS-Tags and have something like this:

--- Example begins
No
,
thanks.+ at +  # this + at + I inserted to mark paragraphs, because the
POS-Tagger don't keep paragraph marks
He
said
,
"
No
,
thanks
.
"
OK
?
--- Example ends
--- And I want to transform into this:
No, thanks.
He said, "No, thanks." OK?
--- End example
I tried different approaches, and dealing with the quote marks is what
is defeating my attempts:

I had originally this code:
import cStringIO
##import string
myoutput = cStringIO.StringIO()
f = open(r'C:\mytools\supersed\tradunew.txt')
lista = [line.strip() for line in f]
punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):

    if item == '"' and lista[i + 1] not in punct:
        myoutput.write(item)
        spacer = True


    elif '+ at +' in item:
        donewline = item.replace('+ at +','\n ')
        myoutput.write(donewline)
    elif item not in punct and lista[i + 1] in punct:
        myoutput.write(item)
    elif item in punct and lista[i + 1] in punct:
        myoutput.write(item)
    elif item in punct and lista[i + 1] == '"' and spacer:
        myoutput.write(item)
        spacer = False
    elif item not in punct and lista[i + 1] == '"' and spacer:
        myoutput.write(item)
        spacer = False
    elif item in '([{“':
        myoutput.write(item)

    else:
        myoutput.write(item + " ")

newlist = myoutput.getvalue().splitlines()



myoutput.close()

f = open(r'C:\mytools\supersed\traducerto-k.txt', 'w')

for line in newlist:
    f.write(line.lstrip()+'\n')
f.close()

#===
I tried this version to post in this forum but that gives me an error.
I don't know why I don't get an error with the code above which is
essentially the same:
# -*- coding: cp1252 -*-

result = ''
lista = [
'No', ',', 'thanks.+ at +',
'He', 'said', ',', '"', 'no', ',', 'thanks', '.', '"', 'OK', '?', 'Hi']

punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):

    if item == '"' and lista[i + 1] not in punct:
        result +=item
        spacer = True
    elif '+ at +' in item:
        donewline = item.replace('+ at +','\n ')
        result += donewline
    elif item not in punct and lista[i + 1] in punct:
        result += item
    elif item in punct and lista[i + 1] in punct:
        result += item
    elif item in punct and lista[i + 1] == '"' and spacer:
        result += item
        spacer = False
    elif item not in punct and lista[i + 1] == '"' and spacer:
        result += item
        spacer = False
    elif item in '([{“':
        result += item
    else:
        result += (item + " ")

print result

#==
The error is this:
Traceback (most recent call last):
  File "<string>", line 244, in run_nodebug
  File "C:\mytools\jointags-v4.py", line 17, in <module>
    elif item not in punct and lista[i + 1] in punct:
IndexError: list index out of range

I'm using python 2.6.2 with PyScripter IDE
I have tried a so many variations that I'm not sure what I'm doing any more....
I'm just trying to avoid some post-processing with sed again.

Thankful,

Eduardo


More information about the Tutor mailing list