How to read between xml tags?

Wed Mar 10 00:28:15 EST 2004

I have a news corpus that looks like the following.  I
want to do a statistical survey of the words used in
the news report per se.  So, I must not consider those
words in the XML tags.

I know that we can use the sgmllib and strip the SML
tags.  But what I want is this:

1. The read operation must either read a full tag or
ignore the tag.

2. If the read operation reads between and ,
then it must reads the whole thing between those 2
tags all at once.

How can I achieve this please?

<DOC id="XIN19910101.0052" type="story">
<HEADLINE>
This is the news headline
</HEADLINE>
<DATELINE>
March 09, 2004
</DATELINE>
<TEXT>

Here comes the first paragraph. There might be more
than one new line characters ('\n') in each paragraph.


And here is the second paragraph.


This is the third paragraph. Please note that the news
articles do not necessarily have the same number of
paragraphs.

</TEXT>
</DOC>
<DOC id="XIN19910101.0053" type="story">
<HEADLINE>
This is another news report
</HEADLINE>
<DATELINE>
......

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com