[scikit-learn] merging the predicted labels with original dataframe

Tom Augspurger tom.augspurger88 at gmail.com
Thu Jul 20 12:19:47 EDT 2017


Something like

    your_df['prediction'] = pd.Series(clf.predict(X_test),
index=X_test.index)

should handle all the alignment.

On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.work at gmail.com>
wrote:

> The original dataset contains both trainng/testing, I have predictions
> only on testing dataset. If I do what you suggest
> will it preserve indexing?
>
> Thanks,
> Ruchika
>
>
> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente <
> julio at esbet.es> wrote:
>
>> Hi Ruchika,
>>
>> The predictions outputted by all sklearn models are just 1-d Numpy
>> arrays, so it should be trivial to add it to any existing DataFrame:
>>
>> your_df["prediction"] = clf.predict(X_test)
>>
>> --
>> Julio
>>
>> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com>
>> escribió:
>>
>> Hi Scikit-learn Users,
>>
>> I am analyzing some proxy logs to use Machine learning to classify the
>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
>> of my code:
>> The input file is a csv with tokenized string fields.
>>
>> **************
>> # load the file
>> M = pd.read_csv("output100k.csv").fillna('')
>>
>> # define the fields to use
>> min_df = 0.001
>> max_df = .7
>> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>>            'destinationZoneURI__tokens','cs-categories__tokens',
>>            'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>>            'app','tcp_status2','dhost'
>>           ]
>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
>>
>> # vectorize the fields
>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
>> for t in TxtCols]
>>
>> # define the columns of sparse matrix
>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
>> TxtCols)] + \
>>                [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for
>> n in NumCols])
>>
>> # target variable
>> Y = M.act.values
>>
>> ## Define train/test parts and scale them
>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
>> scaler = StandardScaler(with_mean=False, with_std=True)
>> scaler.fit(X_train)
>> X_train=scaler.transform(X_train)
>> X_test=scaler.transform(X_test)
>>
>>
>> # define the model and train
>> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_
>> train)
>> # use the model to predict on X_test and convert into a data frame
>> df=pd.DataFrame(clf.predict(X_test))
>>
>> **
>>
>> 199845  OBSERVED
>> 199846  OBSERVED
>>
>> [199847 rows x 1 columns]>
>>
>> **
>>
>> Now at the end I have a DataFrame with 20K entries with just one column
>> "Label", how di I connect it to the main dataframe M, since I want to do
>> some
>> investigations on this outcome ?
>>
>> Any help?
>>
>> Thanks,
>> Ruchika
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/ecdb7115/attachment-0001.html>


More information about the scikit-learn mailing list