how to use structured markup tools

Uche Ogbuji uche.ogbuji at gmail.com
Tue Mar 22 20:58:10 EST 2005


On Sat, 2005-03-19 at 00:14 -0800, Sean McIlroy wrote: 
> I'm dealing with XML files in which there are lots of tags of the
> following form: <a><b>x</b><c>y</c></a> (all of these letters are being
> used as 'metalinguistic variables') Not all of the tags in the file are
> of that form, but that's the only type of tag I'm interested in. (For
> the insatiably curious, I'm talking about a conversation log from MSN
> Messenger.) What I need to do is to pull out all the x's and y's in a
> form I can use. In other words, from...
> 
> .
> .
> <a><b>x1</b><c>y1</c></a>
> .
> .
> <a><b>x2</b><c>y2</c></a>
> .
> .
> <a><b>x3</b><c>y3</c></a>
> .
> .
> 
> ...I would like to produce, for example,...
> 
> [ (x1,y1), (x2,y2), (x3,y3) ]
> 
> Now, I'm aware that there are extensive libraries for dealing with
> marked-up text, but here's the thing: I think I have a reasonable
> understanding of python, but I use it in a lisplike way, and in
> particular I only know the rudiments of how classes work. So here's
> what I'm asking for:
> 
> Can anybody give me a rough idea how to come to grips with the problem
> described above? Or even (dare to dream) example code? Any help will be
> very much appreciated.

There are many tools you can use to get this done in Python.  Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
    matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a><b>x1</b><c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
    matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course.  I even tested it for you in Amara 0.9.4.  And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xml/library/x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xml/library/x-tiplook2.html




More information about the Python-list mailing list