[scikit-learn] Supervised prediction of multiple scores for a document

Sebastian Raschka mail at sebastianraschka.com
Sun Jun 3 17:19:32 EDT 2018


Hi,

> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 

Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical difference is that Softmax regression assumes that the classes are mutually exclusive, which is probably not the case in your setting since e.g., an article could be both "Art" and "Science" to some extend or so. Here a quick summary of softmax regression if useful: https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').

Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be a better choice in your case. I.e., fit the model on the corpus for a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics.

Best,
Sebastian




> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki <amirouche.boubekki at gmail.com> wrote:
> 
> Héllo,
> 
> I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py)
> 
> Given a text it wants to return a dictionary scoring the input against vital articles categories, e.g.:
> 
> out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") 
> 
> for category, score in out: 
>     print('{} ~ {}'.format(category, score))
> 
> The above program would output something like that:
> 
> Art ~ 0.1 
> Science ~ 0.5 
> Society ~ 0.4
> 
> Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all.
> 
> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score each paragraph against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for each paragraph. I compute the sum and group by subcategory and then I have a score per category for the input document
> 
> It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the
> full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
> 
> The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
> 
> or 
> 
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
> 
> I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula:
> value = math.exp(prediction)
> magic = ((value * 100) - 110) * 100
> 
> In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. 
> 
> Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me.
> 
> I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is the class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus;
> I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories.
> 
> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 
> 
> Also, it seems I should benchmark / evaluate my model against LDA.
> 
> I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm.
> 
> Thanks in advance!
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list