[Tutor] Reconstructing phrases from tagged text

Thu Jul 16 22:56:47 CEST 2009

Hello, I'm scratching my head into solving a problem of transforming a
POS-Tagged text in a normal text. My biggest problem is that I can't
get the quotes to be placed properly.
I had tried a different approach but was advised in the nltk list to
use this example, but it does not solve the quoting problem...

#== Here's the sample code ===
# -*- coding: cp1252 -*-

input = """Titelman        NOM     <unknown>
riu-se  V+P     rir
em      PRP     em
seu     ADJ     seu
típico  ADJ     típico
modo    NOM     modo
grosseiro       ADJ     grosseiro
e       CONJ    e
respondeu       V       responder
:       SENT    :
"       QUOTE   "
Você    P       você
prende  V       prender
a       DET     a
criminosos      NOM     criminoso
e       CONJ    e
a       DET     a
pessoas NOM     pessoa
más     ADJ     mau
,       VIRG    ,
mas     CONJ    mas
eu      P       eu
detenção        NOM     detenção
só      ADJ     só
aos     PRP+DET a
anabatistas     NOM     <unknown>
,       VIRG    ,
quem    PR      quem
levam   V       levar
vidas   NOM     vida
boas    ADJ     bom
e       CONJ    e
nunca   ADV     nunca
fazem   V       fazer
uso     NOM     uso
da      PRP+DET de
violência       NOM     violência
"       QUOTE   "
.       SENT    .
Ouvi    V       ouvir
dizer   V       dizer
que     CONJSUB que
o       DET     o
servidor        ADJ     servidor
público NOM     público
lhe     P       lhe
deu     V       dar
uma     DET     uma
boa     ADJ     bom
resposta        NOM     resposta
:       SENT    :
"       QUOTE   "
Se      P       se
eu      P       eu
detenção        NOM     detenção
a       PRP     a
toda    ADJ     todo
a       DET     a
gente   NOM     gente
má      ADJ     mau
e       CONJ    e
você    P       você
prende  V       prender
a       PRP     a
toda    ADJ     todo
a       DET     a
gente   NOM     gente
boa     ADJ     bom
,       VIRG    ,
quem    PR      quem
ficará  V       ficar
?       SENT    ?
"       QUOTE   "
+ at +     V       <unknown>
+ at +     V       <unknown>
–    V       <unknown>
Sabes   NOM     <unknown>
?       SENT    ?
"""

PUNCT = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
LINEBREAK = '+ at +'

tokens = [line.split()[0] for line in input.splitlines()]    # instead

output = []

for t in tokens:
   if t == LINEBREAK:
        output.append('\n')
   else:

        if t not in PUNCT:
            output.append(' ')
        output.append(t)

print ''.join(output)

#==
It ouputs this:
Titelman riu-se em seu típico modo grosseiro e respondeu: " Você
prende a criminosos e a pessoas más, mas eu detenção só aos
anabatistas, quem levam vidas boas e nunca fazem uso da violência ".
Ouvi dizer que o servidor público lhe deu uma boa resposta: " Se eu
detenção a toda a gente má e você prende a toda a gente boa, quem
ficará? "

 – Sabes?

You see, the quotes have spaces in between:
My old code that didn't do the job as good was this (hairy) one:
The removing of tags I had already done with "sed".
myoutput = cStringIO.StringIO()
f = open(r'C:\mytools\supersed\tradunew.txt')
lista = [line.strip() for line in f]
punct = '!#$%)*,-.:;<=>?@/\\]^_`}~”…'
for i, item in enumerate(lista):

   if item == '"' and lista[i + 1] not in punct:
       myoutput.write(item)
       spacer = True

   elif '+ at +' in item:
       donewline = item.replace('+ at +','\n ')
       myoutput.write(donewline)
   elif item not in punct and lista[i + 1] in punct:
       myoutput.write(item)
   elif item in punct and lista[i + 1] in punct:
       myoutput.write(item)
   elif item in punct and lista[i + 1] == '"' and spacer:
       myoutput.write(item)
       spacer = False
   elif item not in punct and lista[i + 1] == '"' and spacer:
       myoutput.write(item)
       spacer = False
   elif item in '([{“':
       myoutput.write(item)

   else:
       myoutput.write(item + " ")

newlist = myoutput.getvalue().splitlines()

myoutput.close()

f = open(r'C:\mytools\supersed\traducerto-k.txt', 'w')

for line in newlist:
   f.write(line.lstrip()+'\n')
f.close()

So a text like this:
--- Example begins
No
,
thanks.+ at +  # this + at + I inserted to mark paragraphs, because the
POS-Tagger don't keep paragraph marks
He
said
,
"
No
,
thanks
.
"
OK
?
--- Example ends
--- And I want to transform into this:
No, thanks.
He said, "No, thanks." OK?
--- End example

Thanks.