[scikit-learn] Supervised prediction of multiple scores for a document
Amirouche Boubekki
amirouche.boubekki at gmail.com
Tue Jul 3 07:46:43 EDT 2018
I made a rendering of the result online https://sensimark.com/
Le dim. 3 juin 2018 à 23:22, Sebastian Raschka <mail at sebastianraschka.com>
a écrit :
> sorry, I had a copy & paste error, I meant "LogisticRegression(...,
> multi_class='multinomial')" and not "LogisticRegression(...,
> multi_class='ovr')"
>
> > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <mail at sebastianraschka.com>
> wrote:
> >
> > Hi,
> >
> >> I quickly read about multinomal regression, is it something do you
> recommend I use? Maybe you think about something else?
> >
> > Multinomial regression (or Softmax Regression) should give you results
> somewhat similar to a linear SVC (or logistic regression with OvO or OvR).
> The theoretical difference is that Softmax regression assumes that the
> classes are mutually exclusive, which is probably not the case in your
> setting since e.g., an article could be both "Art" and "Science" to some
> extend or so. Here a quick summary of softmax regression if useful:
> https://sebastianraschka.com/faq/docs/softmax_regression.html. In
> scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').
> >
> > Howeever, spontaneously, I would say that Latent Dirichlet Allocation
> could be a better choice in your case. I.e., fit the model on the corpus
> for a specified number of topics (e.g., 10, but depends on your dataset, I
> would experiment a bit here), look at the top words in each topic and then
> assign a topic label to each topic. Then, for a given article, you can
> assign e.g., the top X labeled topics.
> >
> > Best,
> > Sebastian
> >
> >
> >
> >
> >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki <
> amirouche.boubekki at gmail.com> wrote:
> >>
> >> Héllo,
> >>
> >> I started a natural language processing project a few weeks ago called
> wikimark (the code is all in wikimark.py)
> >>
> >> Given a text it wants to return a dictionary scoring the input against
> vital articles categories, e.g.:
> >>
> >> out = wikimark("""Peter Hintjens wrote about the relation between
> technology and culture. Without using a scientifical tone of
> state-of-the-art review of the anthroposcene antropology, he gives a fair
> amount of food for thought. According to Hintjens, technology is doomed to
> become cheap. As matter of fact, intelligence tools will become more and
> more accessible which will trigger a revolution to rebalance forces in
> society.""")
> >>
> >> for category, score in out:
> >> print('{} ~ {}'.format(category, score))
> >>
> >> The above program would output something like that:
> >>
> >> Art ~ 0.1
> >> Science ~ 0.5
> >> Society ~ 0.4
> >>
> >> Except not everything went as planned. Mind the fact that in the above
> example the total is equal to 1, but I could not achieve that at all.
> >>
> >> I am using gensim to compute vectors of paragraphs (doc2vev) and then
> submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is
> scored 1 if it's in that subcategory and zero otherwise. At prediction
> time, it goes though the same doc2vec pipeline. The computer will score
> each paragraph against the SVR models of wikipedia vital article
> subcategories and get a value between 0 and 1 for each paragraph. I compute
> the sum and group by subcategory and then I have a score per category for
> the input document
> >>
> >> It somewhat works. I made a web ui online you can find it at
> https://sensimark.com where you can test it. You can directly access the
> >> full api e.g.
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
> >>
> >> The output JSON document is a list of category dictionary where the
> prediction key is associated with the average of the "prediction" of the
> subcategories. If you replace &all=1 by &top=5 you might get something else
> as top categories e.g.
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
> >>
> >> or
> >>
> >>
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
> >>
> >> I wrote "prediction" with double quotes because the value you see, is
> the result of some formula. Since, the predictions I get are rather small
> between 0 and 0.015 I apply the following formula:
> >> value = math.exp(prediction)
> >> magic = ((value * 100) - 110) * 100
> >>
> >> In order to have values to spread between -200 and 200. Maybe this is
> the symptom that my model doesn't work at all.
> >>
> >> Still, the top 10 results are almost always near each other (try with
> BBC articles on https://sensimark.com . It is only when a regression
> model is disqualified with a score of 0 that the results are simple to
> understand. Sadly, I don't have an example at hand to support that claim.
> You have to believe me.
> >>
> >> I just figured looking at the machine learning map that my problem
> might be classification problem, except I don't really want to know what is
> the class of new documents, I want to how what are the different subjects
> that are dealt in the document based on a hiearchical corpus;
> >> I don't want to guess a hiearchy! I want to now how the document
> content spread over the different categories or subcategories.
> >>
> >> I quickly read about multinomal regression, is it something do you
> recommend I use? Maybe you think about something else?
> >>
> >> Also, it seems I should benchmark / evaluate my model against LDA.
> >>
> >> I am rather noob in terms of datascience and my math skills are not so
> fresh. I more likely looking for ideas on what algorithm, fine tuning and
> some practice of datascience I must follow that doesn't involve writing my
> own algorithm.
> >>
> >> Thanks in advance!
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180703/3d76dea2/attachment-0001.html>
More information about the scikit-learn
mailing list