# [scikit-learn] Using logistic regression with count proportions data

Sean Violante sean.violante at gmail.com
Mon Oct 10 10:04:45 EDT 2016

```sorry yes there was a misunderstanding:

I meant for each feature configuration you should pass in two rows (one for
the positive cases and one for the negative)
and the sample weight being the corresponding count for that configuration
and class

and I am saying that the total  count is important because you could have a
situation where
one feature combination occurs 10 times and another feature combination
1000 times

On Mon, Oct 10, 2016 at 3:48 PM, Raphael C <drraph at gmail.com> wrote:

> On 10 October 2016 at 12:22, Sean Violante <sean.violante at gmail.com>
> wrote:
> > no ( but please check !)
> >
> > sample weights should be the counts for the respective label (0/1)
> >
> > [ I am actually puzzled about the glm help file - proportions loses how
> > often an input data 'row' was present relative to the other - though you
> > could do this by repeating the row 'n' times]
>
> I think we might be talking at cross purposes.
>
> I have a matrix X where each row is a feature vector. I also have an
> array y where y[i] is a real number between 0 and 1. I would like to
> build a regression model that predicts the y values given the X rows.
>
> Now each y[i] value in fact comes from simply counting the number of
> positive labelled elements in a particular set (set i) and dividing by
> the number of elements in that set.  So I can easily fit this into the
> model given by the R package glm by replacing each y[i] value by a
> pair of "Number of positives" and "Number of negatives" (this is case
> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
> the total number of elements in set i.
>
> I don't see how a single integer for sample_weight[i] would cover this
> information but I am sure I must have misunderstood.  At best it seems
> it could cover the number of positive values but this is missing half
> the information.
>
> Raphael
>
> >
> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:
> >>
> >> How do I use sample_weight for my use case?
> >>
> >> In my case is "y" an array of 0s and 1s and sample_weight then an
> >> array real numbers between 0 and 1 where I should make sure to set
> >> sample_weight[i]= 0 when y[i] = 0?
> >>
> >> Raphael
> >>
> >> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
> >> wrote:
> >> > should be the sample weight function in fit
> >> >
> >> >
> >> > http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >
> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
> >> >>
> >> >> I just noticed this about the glm package in R.
> >> >> http://stats.stackexchange.com/a/26779/53128
> >> >>
> >> >> "
> >> >> The glm function in R allows 3 ways to specify the formula for a
> >> >> logistic regression model.
> >> >>
> >> >> The most common is that each row of the data frame represents a
> single
> >> >> observation and the response variable is either 0 or 1 (or a factor
> >> >> with 2 levels, or other varibale with only 2 unique values).
> >> >>
> >> >> Another option is to use a 2 column matrix as the response variable
> >> >> with the first column being the counts of 'successes' and the second
> >> >> column being the counts of 'failures'.
> >> >>
> >> >> You can also specify the response as a proportion between 0 and 1,
> >> >> then specify another column as the 'weight' that gives the total
> >> >> number that the proportion is from (so a response of 0.3 and a weight
> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
> >> >>
> >> >> Either of the last two options would do for me.  Does scikit-learn
> >> >> support either of these last two options?
> >> >>
> >> >> Raphael
> >> >>
> >> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
> >> >> > I am trying to perform regression where my dependent variable is
> >> >> > constrained to be between 0 and 1. This constraint comes from the
> >> >> > fact
> >> >> > that it represents a count proportion. That is counts in some
> >> >> > category
> >> >> > divided by a total count.
> >> >> >
> >> >> > In the literature it seems that one common way to tackle this is to
> >> >> > use logistic regression. However, it appears that in scikit learn
> >> >> > logistic regression is only available as a classifier
> >> >> >
> >> >> >
> >> >> > (http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >> > ) . Is that right?
> >> >> >
> >> >> > Is there another way to perform regression using scikit learn where
> >> >> > the dependent variable is a count proportion?
> >> >> >
> >> >> > Thanks for any help.
> >> >> >
> >> >> > Raphael
> >> >> _______________________________________________
> >> >> scikit-learn mailing list
> >> >> scikit-learn at python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > scikit-learn mailing list
> >> > scikit-learn at python.org
> >> > https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/7989840e/attachment-0001.html>
```

More information about the scikit-learn mailing list