[Tutor] "words", tags, "nonwords" in xml/text files

Wed May 24 09:31:06 CEST 2006

I'm developing an application to do interlineal (an extreme type of
literal) translations of natural language texts and xml. Here's an example
of a text:

'''Para eso son los amigos. Para celebrar las gracias del otro.'''

and the expected translation with all of the original tags, whitespace,
etc intact:

'''For that are the friends. For toCelebrate the graces ofThe
other.'''

I was unable to find (in htmlparser, string or unicode) a way to define
words as a series of letters (including non-ascii char sets) outside of an
xml tag and whitespace/punctuation, so I wrote the code below to create a
list of the words, nonwords, and  xml tags in a text. My intuition tells
me that its an awful lot of code to do a simple thing, but it's the best I
could come up with. I forsee several problems:

-it currently requires that the entire string (or file) be processed into
memory. if i should want to process a large file line by line, a tab which
spans more than one line would be ignored. (that's assuming i would not be
able to store state information in the function, which is something i've
not yet learned how to do)
-html comments may not be supported. (i'm not really sure about this)
-it may be very slow as it indexes instead of iterating over the string. 

what can i do to overcome these issues? Am I reinventing the wheel? Should
I be using re?

thanks,
brian
********************************
# -*- coding: utf-8 -*-
# html2list.py

def split(alltext, charset='ñÑçÇáÁéÉíÍóÓúÚ'): #in= string; out= list of
words, nonwords, html tags.
    '''builds a list of the words, tags, and nonwords in a text.'''
    length = len(alltext)
    str2list = []
    url = []
    word = []
    nonword = []
    i = 0
    if alltext[i] == '<':
        url.append(alltext[i])
    elif alltext[i].isalpha() or alltext[i] in charset:
        word.append(alltext[i])
    else:
        nonword.append(alltext[i])
    i += 1
    while i < length:
        if url:
            if alltext[i] == '>':        #end url:
                url.append(alltext[i])
                str2list.append("".join(url))
                url = []
                i += 1
                if alltext[i].isalpha() or alltext[i] in charset:   
#start word
                    word.append(alltext[i])
                else:                       #start nonword
                    nonword.append(alltext[i])
            else:
                url.append(alltext[i])
        elif word:
            if alltext[i].isalpha() or alltext[i] in charset:    #continue
word
                word.append(alltext[i])
            elif alltext[i] == '<':     #start url
                str2list.append("".join(word))
                word = []
                url.append(alltext[i])
            else:                       #start nonword
                str2list.append("".join(word))
                word = []
                nonword.append(alltext[i])
        elif nonword:
            if alltext[i].isalpha() or alltext[i] in charset:    #start
word
                str2list.append("".join(nonword))
                nonword = []
                word.append(alltext[i])
            elif alltext[i] == '<':     #start url
                str2list.append("".join(nonword))
                nonword = []
                url.append(alltext[i])
            else:                       #continue nonword
                nonword.append(alltext[i])
        else:
            print 'error',
        i += 1
    if nonword:
        str2list.append("".join(nonword))
    if url:
        str2list.append("".join(url))
    if word:
        str2list.append("".join(word))

    return str2list

## example:
text = '''El aguardiente de caña le quemó la garganta y devolvió la
botella con una mueca.
—No se me ponga feo, doctor. Esto mata los bichos de las tripas —dijo
Antonio José Bolívar, pero no pudo seguir hablando.'''

print split(text)

___________________________________________________________ 
Try the all-new Yahoo! Mail. "The New Version is radically easier to use" – The Wall Street Journal 
http://uk.docs.yahoo.com/nowyoucan.html