<div dir="ltr"><div>Yes, the output CSR representation requires:</div><div>1 (dtype) value per entry</div><div>1 int32 per entry</div><div>1 int32 per row</div><div><br></div><div>The intermediate COO representation requires:</div><div>1 (dtype) value per entry</div><div>2 int32 per entry</div><div><br></div>So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5x the input size, which is exactly what you are experienciong.<div><br></div><div>The CategoricalEncoder currently available in the development version of scikit-learn does not have this problem, but might be slower due to handling non-integer categories. It will also possibly disappear and be merged into OneHotEncoder soon (see PR #10523).</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 6 February 2018 at 13:53, Sarah Wait Zaranek <span dir="ltr"><<a href="mailto:sarah.zaranek@gmail.com" target="_blank">sarah.zaranek@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small">Yes, of course.  What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size).  These numbers aren't exact, but you can see my point.</div><div class="gmail_default" style="font-size:small"><br>Cheers,<br>Sarah</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <span dir="ltr"><<a href="mailto:joel.nothman@gmail.com" target="_blank">joel.nothman@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input.</div><div class="m_8196565409696887413HOEnZb"><div class="m_8196565409696887413h5"><div class="gmail_extra"><br><div class="gmail_quote">On 6 February 2018 at 13:24, Sarah Wait Zaranek <span dir="ltr"><<a href="mailto:sarah.zaranek@gmail.com" target="_blank">sarah.zaranek@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small">Hi Joel -</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">I am also seeing a huge overhead in memory for calling the onehot-encoder.  I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results.  But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report?  Or is this to be expected.</div><div class="gmail_default" style="font-size:small"><br>Cheers,<br>Sarah</div></div><div class="m_8196565409696887413m_-6563035992571184875HOEnZb"><div class="m_8196565409696887413m_-6563035992571184875h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <span dir="ltr"><<a href="mailto:sarah.zaranek@gmail.com" target="_blank">sarah.zaranek@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small">Great.  Thank you for all your help.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Cheers,<br>Sarah</div></div><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567HOEnZb"><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <span dir="ltr"><<a href="mailto:joel.nothman@gmail.com" target="_blank">joel.nothman@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">If you specify n_values=[list_of_vals_for_col<wbr>umn1, list_of_vals_for_column2], you should be able to engineer it to how you want.</div><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567m_2939353897352423284HOEnZb"><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567m_2939353897352423284h5"><div class="gmail_extra"><br><div class="gmail_quote">On 5 February 2018 at 16:31, Sarah Wait Zaranek <span dir="ltr"><<a href="mailto:sarah.zaranek@gmail.com" target="_blank">sarah.zaranek@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small">If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small"><div class="gmail_default">>>> test</div><div class="gmail_default">array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],</div><div class="gmail_default">       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],</div><div class="gmail_default">       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],</div><div class="gmail_default">       [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])</div><div><br></div></div></div><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567m_2939353897352423284m_4766008475484608384HOEnZb"><div class="m_8196565409696887413m_-6563035992571184875m_6274150491252608567m_2939353897352423284m_4766008475484608384h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <span dir="ltr"><<a href="mailto:sarah.zaranek@gmail.com" target="_blank">sarah.zaranek@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small">Hi Joel -</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Conceptually, that makes sense.  But when I assign n_values, I can't make it match the result when you don't specify them. See below.  I used the number of unique levels per column.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small"><div class="gmail_default">>>> enc = OneHotEncoder(sparse=False)</div><div class="gmail_default">>>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]])</div><div class="gmail_default">>>> test</div><div class="gmail_default">array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],</div><div class="gmail_default">       [0., 1., 0., 0., 1., 1., 0., 0., 0.],</div><div class="gmail_default">       [1., 0., 0., 0., 1., 0., 1., 0., 0.],</div><div class="gmail_default">       [0., 1., 0., 1., 0., 0., 0., 1., 0.]])</div><div class="gmail_default">>>> enc = OneHotEncoder(sparse=False,n_v<wbr>alues=[3,2,4])</div><div class="gmail_default">>>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]])</div><div class="gmail_default">>>> test</div><div class="gmail_default">array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],</div><div class="gmail_default">       [0., 1., 0., 0., 0., 2., 0., 0., 0.],</div><div class="gmail_default">       [1., 0., 0., 0., 0., 1., 1., 0., 0.],</div><div class="gmail_default">       [0., 1., 0., 1., 0., 0., 0., 1., 0.]])</div><div><br></div><div>Cheers,<br>Sarah</div></div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Cheers,<br>Sarah</div></div><div class="gmail_extra"><br><div class="gmail_quote"><span>On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <span dir="ltr"><<a href="mailto:joel.nothman@gmail.com" target="_blank">joel.nothman@gmail.com</a>></span> wrote:<br></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span><div dir="ltr"><div class="gmail_extra">If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?</div><div class="gmail_extra"><br></div><div class="gmail_extra">Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.</div></div>

<br></span><span>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></span></blockquote></div><br></div>

</blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>