How to read between xml tags?

Wed Mar 10 06:28:15 CET 2004

I have a news corpus that looks like the following.  I
want to do a statistical survey of the words used in
the news report per se.  So, I must not consider those
words in the XML tags.

I know that we can use the sgmllib and strip the SML
tags.  But what I want is this:

1. The read operation must either read a full tag or
ignore the tag.

2. If the read operation reads between <P> and </P>,
then it must reads the whole thing between those 2
tags all at once.

How can I achieve this please?

<DOC id="XIN19910101.0052" type="story">
This is the news headline
March 09, 2004
Here comes the first paragraph. There might be more
than one new line characters ('\n') in each paragraph.
And here is the second paragraph.
This is the third paragraph. Please note that the news
articles do not necessarily have the same number of
<DOC id="XIN19910101.0053" type="story">
This is another news report

