[scikit-learn] Fwd: Loading file in libsvm format

Thu Sep 8 14:40:26 EDT 2016

---------- Forwarded message ----------
From: klo uo <klonuo at gmail.com>
Date: Thu, Sep 8, 2016 at 8:25 PM
Subject: Loading file in libsvm format
To: scikit-learn-general at lists.sourceforge.net

Hi,

I produced a file in libsvm format:

    <label> <index1>:<value1> <index2>:<value2> ...

with this content:

    6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1
    2259 1669:1 5711528:6
    2822 5765159:1
    ...

The label is document_id, and index:value are term_id and term count.

This file has 83K labels with 40K unique terms (and overall 1.2M
index:value pairs).

When I load this file in sklearn:

    from sklearn.datasets import load_svmlight_file
    X, y = load_svmlight_file('libsim.txt')

I get X with shape (82448, 6092168).

I don't know of any reason why am I getting 6M features?
Can someone explain?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/14fbb1c9/attachment.html>