[scikit-learn] Supervised prediction of multiple scores for a document
Amirouche Boubekki
amirouche.boubekki at gmail.com
Sun Jun 3 17:03:08 EDT 2018
Héllo,
I started a natural language processing project a few weeks ago called
wikimark <https://github.com/amirouche/wikimark/> (the code is all in
wikimark.py
<https://github.com/amirouche/wikimark/blob/master/wikimark.py#L1>)
Given a text it wants to return a dictionary scoring the input against vital
articles categories
<https://en.wikipedia.org/api/rest_v1/page/html/Wikipedia%3AVital_articles%2FLevel%2F5>,
e.g.:
out = wikimark("""Peter Hintjens wrote about the relation between
technology and culture. Without using a scientifical tone of
state-of-the-art review of the anthroposcene antropology, he gives a fair
amount of food for thought. According to Hintjens, technology is doomed to
become cheap. As matter of fact, intelligence tools will become more and
more accessible which will trigger a revolution to rebalance forces in
society.""")
for category, score in out:
print('{} ~ {}'.format(category, score))
The above program would output something like that:
Art ~ 0.1
Science ~ 0.5
Society ~ 0.4
Except not everything went as planned. Mind the fact that in the above
example the total is equal to 1, but I could not achieve that at all.
I am using gensim to compute vectors of paragraphs (doc2vev) and then
submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is
scored 1 if it's in that subcategory and zero otherwise. At prediction
time, it goes though the same doc2vec pipeline. The computer will score *each
paragraph* against the SVR models of wikipedia vital article subcategories
and get a value between 0 and 1 for *each paragraph*. I compute the sum and
group by subcategory and then I have a score per category for the input
document
It somewhat works. I made a web ui online you can find it at
https://sensimark.com where you can test it. You can directly access the
full api e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
The output JSON document is a list of category dictionary where the
prediction key is associated with the average of the "prediction" of the
subcategories. If you replace &all=1 by &top=5 you might get something else
as top categories e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
<https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1>
or
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
I wrote "prediction" with double quotes because the value you see, is the
result of some formula. Since, the predictions I get are rather small
between 0 and 0.015 I apply the following formula:
value = math.exp(prediction)
magic = ((value * 100) - 110) * 100
In order to have values to spread between -200 and 200. Maybe this is the
symptom that my model doesn't work at all.
Still, the top 10 results are almost always near each other (try with BBC
<http://www.bbc.com/> articles on https://sensimark.com . It is only when a
regression model is disqualified with a score of 0 that the results are
simple to understand. Sadly, I don't have an example at hand to support
that claim. You have to believe me.
I just figured looking at the machine learning map
<http://scikit-learn.org/stable/tutorial/machine_learning_map/> that my
problem might be classification problem, except I don't really want to know
what is *the* class of new documents, I want to how what are the different
subjects that are dealt in the document based on a hiearchical corpus;
I don't want to guess a hiearchy! I want to now how the document content
spread over the different categories or subcategories.
I quickly read about multinomal regression, is it something do you
recommend I use? Maybe you think about something else?
Also, it seems I should benchmark / evaluate my model against LDA.
I am rather noob in terms of datascience and my math skills are not so
fresh. I more likely looking for ideas on what algorithm, fine tuning and
some practice of datascience I must follow that doesn't involve writing my
own algorithm.
Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180603/c7dc7ad0/attachment.html>
More information about the scikit-learn
mailing list