[Tutor] text to xml

Paul Tremblay phthenry@earthlink.net
Mon Jun 9 23:34:02 2003


On Tue, Jun 10, 2003 at 11:41:08AM +1200, Tom Brownlee wrote:
> 
> hello there,
> 
> ive posted before about parsing a text file (a tertiary course outline) and 
> having a output file of the course outline in xml format.

Well, I am *very* interested in doing just this myself. I had written a
pretty involved parser (several different modules) before my hard drive
went haywire on me, and I lost a week of very intense work.

For now, I use the docutils utilities to convert my text to XML. I have
just written a module to extend the extensibility of docutils, so that
it can handle nested inline markup.

You may want to have a look at docutils. It allows for very coplex
markup. (http://docutils.sf.net/docutils-snapshot.tgz)
 

The downside of doing this is that you will have to write an xslt style
sheet to change the XML that docutils outputs to the XML you want. The
advantage is you will use a powerful tool alrady developed, that allows
to to extend it as you want to.

Hope this helps

Paul

> 
> ive gone away and come up with the following python code, but it seems to 
> lack something, and im not to sure what it is.
> 
> ive attached my code with the course outline for you (people) to play with.
> 
> what i have to do is convert all major headings (uppercase) and 
> sub-headings into xml tags and leave the rest of the text as is.
> 
> WHAT AM I MISSING?????  to me it seemed feasible to work but it doesnt 
> quite get there.
> 
> thankyou for your help
> 
> tom brownlee
> 
> _________________________________________________________________
> Download MSN Messenger @  http://messenger.xtramsn.co.nz   - talk to family 
> and friends overseas!

> import re
> 
> from xml.dom.minidom import *
> 
> def main(arg):
>    #input argument txt file
>    try:
>        f = open(arg)
>        #open the argument file
>    except:
>        print "cannot open file"
>        #if not possible then print error
> 
>    #create a new parent Document
>    newdocument = Document()
>    #create a lower level tag
>    rootElement = newdocument.createElement("Course Outline")
>    #add the lower level tag to the parent Document
>    newdocument.appendChild(rootElement)
> 
>    tagSequence = re.compile("(^\d+)\t+")
>    while 1:
>        line = f.readline()
>        if len(line) == 0:
>            break
>        # remove end of line marker
>        #s = line[:len(line)-1]
>        s = line
> 
>        target = tagSequence.search(s)
>        if target:
>            #flag = 1
>            s2 = re.search("\t", s) # find tab position within the string s
>            # get substring from the end of the tab (span()gives the tab 
> start and end position)
>            result = s[s2.span()[1]:]
>            newElement = newdocument.createElement(result)
>            rootElement.appendChild(newElement)
> 
>    x = newdocument.toxml() # get the contents of the buffer (the set of 
> tags) produced by toxml()
>    f=open('CourseOutlineXML.xml', 'w')
>    f.write(x)
>    print x
> 
> if __name__ == '__main__':
>    main("PR301CourseOutline2003.txt")
> 
> 



-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************