Finding keywords
Terry Reedy
tjreedy at udel.edu
Tue Mar 8 16:00:59 EST 2011
On 3/8/2011 2:00 PM, Matt Chaput wrote:
> On 08/03/2011 8:58 AM, Cross wrote:
>> I know meta tags contain keywords but they are not always reliable. I
>> can parse xhtml to obtain keywords from meta tags; but how do I verify
>> them. To obtain reliable keywords, I have to parse the plain text
>> obtained from the URL.
This, of course, is a problem for all search engines, especially given
'search optimization' games.
> I think maybe what the OP is asking about is extracting key words from a
> text, i.e. a short list of words that characterize the text. This is an
> information retrieval problem, not really a Python problem.
>
> One simple way to do this is to calculate word frequency histograms for
> each document in your corpus, and then for a given document, select
> words that are frequent in that document but infrequent in the corpus as
> a whole. Whoosh does this.
I believe Google does something like this also. I have seen a claim that
Google only looks at the first x words, hence the advice 'Make sure your
target keywords are in the first x words.'. You, of course, can and
should process entire docs
--
Terry Jan Reedy
More information about the Python-list
mailing list