how to use structured markup tools
Uche Ogbuji
uche.ogbuji at gmail.com
Tue Mar 22 20:58:10 EST 2005
On Sat, 2005-03-19 at 00:14 -0800, Sean McIlroy wrote:
> I'm dealing with XML files in which there are lots of tags of the
> following form: <a><b>x</b><c>y</c></a> (all of these letters are being
> used as 'metalinguistic variables') Not all of the tags in the file are
> of that form, but that's the only type of tag I'm interested in. (For
> the insatiably curious, I'm talking about a conversation log from MSN
> Messenger.) What I need to do is to pull out all the x's and y's in a
> form I can use. In other words, from...
>
> .
> .
> <a><b>x1</b><c>y1</c></a>
> .
> .
> <a><b>x2</b><c>y2</c></a>
> .
> .
> <a><b>x3</b><c>y3</c></a>
> .
> .
>
> ...I would like to produce, for example,...
>
> [ (x1,y1), (x2,y2), (x3,y3) ]
>
> Now, I'm aware that there are extensive libraries for dealing with
> marked-up text, but here's the thing: I think I have a reasonable
> understanding of python, but I use it in a lisplike way, and in
> particular I only know the rudiments of how classes work. So here's
> what I'm asking for:
>
> Can anybody give me a rough idea how to come to grips with the problem
> described above? Or even (dare to dream) example code? Any help will be
> very much appreciated.
There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )
DOC = """\
<matrix>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</matrix>
"""
from amara import binderytools
matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))
print matrix
Which outputs:
[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]
If your matrix actually has a variable or previously unknown number of
columns (e.g. <a><b>x1</b><c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))
Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.
You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xml/library/x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xml/library/x-tiplook2.html
More information about the Python-list
mailing list