[Tutor] "words", tags, "nonwords" in xml/text files

Wed May 24 11:53:21 CEST 2006

rio wrote:
> I'm developing an application to do interlineal (an extreme type of
> literal) translations of natural language texts and xml. Here's an example
> of a text:
> 
> '''Para eso son los amigos. Para celebrar <i>las gracias</i> del otro.'''
> 
> and the expected translation with all of the original tags, whitespace,
> etc intact:
> 
> '''For that are the friends. For toCelebrate <i>the graces</i> ofThe
> other.<p>'''
> 
> I was unable to find (in htmlparser, string or unicode) a way to define
> words as a series of letters (including non-ascii char sets) outside of an
> xml tag and whitespace/punctuation, so I wrote the code below to create a
> list of the words, nonwords, and  xml tags in a text. My intuition tells
> me that its an awful lot of code to do a simple thing, but it's the best I
> could come up with. I forsee several problems:
> 
> -it currently requires that the entire string (or file) be processed into
> memory. if i should want to process a large file line by line, a tab which
> spans more than one line would be ignored. (that's assuming i would not be
> able to store state information in the function, which is something i've
> not yet learned how to do)
> -html comments may not be supported. (i'm not really sure about this)
> -it may be very slow as it indexes instead of iterating over the string. 
> 
> what can i do to overcome these issues? Am I reinventing the wheel? Should
> I be using re?

You should probably be using sgmllib. Here is an example that is pretty 
close to what you are doing:
http://diveintopython.org/html_processing/index.html

Kent