[Tutor] "words", tags, "nonwords" in xml/text files
Kent Johnson
kent37 at tds.net
Wed May 24 11:53:21 CEST 2006
rio wrote:
> I'm developing an application to do interlineal (an extreme type of
> literal) translations of natural language texts and xml. Here's an example
> of a text:
>
> '''Para eso son los amigos. Para celebrar <i>las gracias</i> del otro.'''
>
> and the expected translation with all of the original tags, whitespace,
> etc intact:
>
> '''For that are the friends. For toCelebrate <i>the graces</i> ofThe
> other.<p>'''
>
> I was unable to find (in htmlparser, string or unicode) a way to define
> words as a series of letters (including non-ascii char sets) outside of an
> xml tag and whitespace/punctuation, so I wrote the code below to create a
> list of the words, nonwords, and xml tags in a text. My intuition tells
> me that its an awful lot of code to do a simple thing, but it's the best I
> could come up with. I forsee several problems:
>
> -it currently requires that the entire string (or file) be processed into
> memory. if i should want to process a large file line by line, a tab which
> spans more than one line would be ignored. (that's assuming i would not be
> able to store state information in the function, which is something i've
> not yet learned how to do)
> -html comments may not be supported. (i'm not really sure about this)
> -it may be very slow as it indexes instead of iterating over the string.
>
> what can i do to overcome these issues? Am I reinventing the wheel? Should
> I be using re?
You should probably be using sgmllib. Here is an example that is pretty
close to what you are doing:
http://diveintopython.org/html_processing/index.html
Kent
More information about the Tutor
mailing list