How to read between xml tags?

Anthony Liu antonyliu2002 at yahoo.com
Wed Mar 10 00:28:15 EST 2004


I have a news corpus that looks like the following.  I
want to do a statistical survey of the words used in
the news report per se.  So, I must not consider those
words in the XML tags.

I know that we can use the sgmllib and strip the SML
tags.  But what I want is this:

1. The read operation must either read a full tag or
ignore the tag.

2. If the read operation reads between <P> and </P>,
then it must reads the whole thing between those 2
tags all at once.

How can I achieve this please?


<DOC id="XIN19910101.0052" type="story">
<HEADLINE>
This is the news headline
</HEADLINE>
<DATELINE>
March 09, 2004
</DATELINE>
<TEXT>
<P>
Here comes the first paragraph. There might be more
than one new line characters ('\n') in each paragraph.
</P>
<P>
And here is the second paragraph.
</P>
<P>
This is the third paragraph. Please note that the news
articles do not necessarily have the same number of
paragraphs.
</P>
</TEXT>
</DOC>
<DOC id="XIN19910101.0053" type="story">
<HEADLINE>
This is another news report
</HEADLINE>
<DATELINE>
......

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com




More information about the Python-list mailing list