[Tutor] XML data reading

Michael Langford mlangford.cs03 at gtalumni.org
Thu Dec 20 23:12:39 CET 2007


Sax is the simplest to get started with. here is a simple example. See
http://docs.python.org/lib/content-handler-objects.html for more info
on the methods of ContentHandler. Using some list or dict of tags
you're processing such as I do in the following example will keep "tag
specific code" down to a minimum. Theoretically, you'd just have to
change the list you initialize the object with and the endElement
function to handle a new tag.

from xml import sax
from xml.sax.handler import ContentHandler
class myhandler(ContentHandler):
	def __init__(self, tagsToChirpOn=None):
        	self.last = ""
		self.info = ""
		if tagsToChirpOn is None:
               		self.chirpTags = []
         	else:
               		self.chirpTags = tagsToChirpOn


    	#then you define a start element method
	#   this is called for each open tag you see
  	def startElement(self,name,attr):
        	self.last = name
		self.info = ""
           	if name in self.chirpTags:
                	print "starting %s tag"  % name


    	#then you define a characters method, which
	#  is called on sections of text inside the
	#  tags until it is all found
    	def characters(self,content):
        	if self.last in self.chirpTags:
			self.info +=content
			
     	#then if you need to define an action to happen
	#when an end tag is hit, you write a
    	def endElement(self,name):
        	"""called at </closetag"""
		if len(self.info) > 0:
			print "In tag %s was data{{%s}}" % (self.last,self.info)
           	if name in self.chirpTags:
                	print "Now leaving the %s tag" % name

if __name__=="__main__":
     document = """
     <xml>
     	<foo>line 1
	   bars are fun
	</foo>
	<bar>line 2
	   dogs don't like celery
	</bar>
	<baz>
	   121309803124.12
	</baz>
     </xml>"""
     hand = myhandler(["bar","baz"])
     sax.parseString(document,hand)

	
You often need to build a state machine or some other stateful
tracking system to make Sax parsers do complicated things, but the
above is good enough for most things involving data. If you use the
start tag to create a new object, the characters tag to populate it
and then the endElement tag to submit the object to a greater data
structure, you can very easily build objects out of XML data of any
source. I used sax parsers most recently on parsing out REST data from
amazon.

urlib2 and sax parsers are formidable, quick technologies to perform
simple parsing needs. Look into BeautifulSoup as well:
http://www.crummy.com/software/BeautifulSoup/

                        --Michael

On Dec 20, 2007 4:15 PM, Lockhart, Luke <LockhartL at ripon.edu> wrote:
>
>
>
>
> Hello all,
>
>  So I'm a very novice Python programmer. I've done stuff up to the
> intermediate level in Microsoft flavors of BASIC and C++, but now I'm a
> Linux man and trying to realize my overly ambitious programming dreams with
> Python, mainly because I have friends who use it and because it has
> libraries that in general are very good at doing what I want to do.
>
>  Now, the program I'm working on would hypothetically store and read all
> data as XML, and yes I'm aware of the performance drawbacks and I'm willing
> to live with them. But I just can't figure out the various libraries that
> Python uses to read XML, and my friend's code doesn't help much.
>
>  Basically, what I really would like to do, without going into a lot of
> detail, is be able to read various tags into classes and arrays until the
> entire file has been read, then remove the file from memory. I first tried
> to use the basic Python XML libraries, and then my friend recommended SAX -
> but so far as I can tell, either method requires numerous lines of code to
> support one new tag. Is this what I'm going to have to do, or is there a
> simpler way?
>
>  Thanks in advance,
>  Luke
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>



-- 
Michael Langford
Phone: 404-386-0495
Consulting: http://www.RowdyLabs.com


More information about the Tutor mailing list