[Tutor] "words", tags, "nonwords" in xml/text files
rio
vagemulo at yahoo.es
Wed May 24 09:31:06 CEST 2006
I'm developing an application to do interlineal (an extreme type of
literal) translations of natural language texts and xml. Here's an example
of a text:
'''Para eso son los amigos. Para celebrar <i>las gracias</i> del otro.'''
and the expected translation with all of the original tags, whitespace,
etc intact:
'''For that are the friends. For toCelebrate <i>the graces</i> ofThe
other.<p>'''
I was unable to find (in htmlparser, string or unicode) a way to define
words as a series of letters (including non-ascii char sets) outside of an
xml tag and whitespace/punctuation, so I wrote the code below to create a
list of the words, nonwords, and xml tags in a text. My intuition tells
me that its an awful lot of code to do a simple thing, but it's the best I
could come up with. I forsee several problems:
-it currently requires that the entire string (or file) be processed into
memory. if i should want to process a large file line by line, a tab which
spans more than one line would be ignored. (that's assuming i would not be
able to store state information in the function, which is something i've
not yet learned how to do)
-html comments may not be supported. (i'm not really sure about this)
-it may be very slow as it indexes instead of iterating over the string.
what can i do to overcome these issues? Am I reinventing the wheel? Should
I be using re?
thanks,
brian
********************************
# -*- coding: utf-8 -*-
# html2list.py
def split(alltext, charset='ñÑçÇáÁéÉíÍóÓúÚ'): #in= string; out= list of
words, nonwords, html tags.
'''builds a list of the words, tags, and nonwords in a text.'''
length = len(alltext)
str2list = []
url = []
word = []
nonword = []
i = 0
if alltext[i] == '<':
url.append(alltext[i])
elif alltext[i].isalpha() or alltext[i] in charset:
word.append(alltext[i])
else:
nonword.append(alltext[i])
i += 1
while i < length:
if url:
if alltext[i] == '>': #end url:
url.append(alltext[i])
str2list.append("".join(url))
url = []
i += 1
if alltext[i].isalpha() or alltext[i] in charset:
#start word
word.append(alltext[i])
else: #start nonword
nonword.append(alltext[i])
else:
url.append(alltext[i])
elif word:
if alltext[i].isalpha() or alltext[i] in charset: #continue
word
word.append(alltext[i])
elif alltext[i] == '<': #start url
str2list.append("".join(word))
word = []
url.append(alltext[i])
else: #start nonword
str2list.append("".join(word))
word = []
nonword.append(alltext[i])
elif nonword:
if alltext[i].isalpha() or alltext[i] in charset: #start
word
str2list.append("".join(nonword))
nonword = []
word.append(alltext[i])
elif alltext[i] == '<': #start url
str2list.append("".join(nonword))
nonword = []
url.append(alltext[i])
else: #continue nonword
nonword.append(alltext[i])
else:
print 'error',
i += 1
if nonword:
str2list.append("".join(nonword))
if url:
str2list.append("".join(url))
if word:
str2list.append("".join(word))
return str2list
## example:
text = '''El aguardiente de caña le quemó la garganta y devolvió la
botella con una mueca.
No se me ponga feo, doctor. Esto mata los bichos de las tripas dijo
Antonio José Bolívar, pero no pudo seguir hablando.'''
print split(text)
___________________________________________________________
Try the all-new Yahoo! Mail. "The New Version is radically easier to use" The Wall Street Journal
http://uk.docs.yahoo.com/nowyoucan.html
More information about the Tutor
mailing list