Finding keywords

Tue Mar 8 14:51:12 EST 2011

2011/3/8 Cross <X at x.tv>:
> On 03/08/2011 06:09 PM, Heather Brown wrote:
>>
>> The keywords are an attribute in a tag called <meta>, in the section
>> called
>> <head>. Are you having trouble parsing the xhtml to that point?
>>
>> Be more specific in your question, and somebody is likely to chime in.
>> Although
>> I'm not the one, if it's a question of parsing the xhtml.
>>
>> DaveA
>
> I know meta tags contain keywords but they are not always reliable. I can
> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
> obtain reliable keywords, I have to parse the plain text obtained from the
> URL.
>
> Cross
>
> --- news://freenews.netfront.net/ - complaints: news at netfront.net ---
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
if you need to extract meaningful keywords in terms of data mining
using natural language processing, it might become quite a complex
task, depending on the requirements; the NLTK toolkit may help with
some approaches [ http://www.nltk.org/ ].
One possibility would be to filter out more frequent and less
meaningful words ("stopwords") and extract the more frequent words
from the reminder., e.g. (with some simplifications/hacks in the
interactive mode):

>>> import re, urllib2, nltk
>>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
>>> page_plain = nltk.clean_html(page_src).lower()
>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
>>> frequency_dist = nltk.FreqDist(txt_filtered)
>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq > 2]
[(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
(u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
(u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
(u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
(u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
(u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
(u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
3), (u'readable', 3), (u'write', 3)]
>>>

Another possibility would be to extract parts of speech (e.g. nouns,
adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
for more convoluted html code e.g. BeautifulSoup might be used and
there are likely many other options.

hth,
  vbr