[scikit-learn] merging the predicted labels with original dataframe
Ruchika Nayyar
ruchika.work at gmail.com
Thu Jul 20 12:30:24 EDT 2017
Hi Tom
This was also the first thing that came to my mind, but I thought sincr
your_df is X_train+X_test
it may complain that values do not match with the given indices.
Thanks,
Ruchika
On Thu, Jul 20, 2017 at 12:19 PM, Tom Augspurger <tom.augspurger88 at gmail.com
> wrote:
> Something like
>
> your_df['prediction'] = pd.Series(clf.predict(X_test),
> index=X_test.index)
>
> should handle all the alignment.
>
> On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.work at gmail.com>
> wrote:
>
>> The original dataset contains both trainng/testing, I have predictions
>> only on testing dataset. If I do what you suggest
>> will it preserve indexing?
>>
>> Thanks,
>> Ruchika
>>
>>
>> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente <
>> julio at esbet.es> wrote:
>>
>>> Hi Ruchika,
>>>
>>> The predictions outputted by all sklearn models are just 1-d Numpy
>>> arrays, so it should be trivial to add it to any existing DataFrame:
>>>
>>> your_df["prediction"] = clf.predict(X_test)
>>>
>>> --
>>> Julio
>>>
>>> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com>
>>> escribió:
>>>
>>> Hi Scikit-learn Users,
>>>
>>> I am analyzing some proxy logs to use Machine learning to classify the
>>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
>>> of my code:
>>> The input file is a csv with tokenized string fields.
>>>
>>> **************
>>> # load the file
>>> M = pd.read_csv("output100k.csv").fillna('')
>>>
>>> # define the fields to use
>>> min_df = 0.001
>>> max_df = .7
>>> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>>> 'destinationZoneURI__tokens','cs-categories__tokens',
>>> 'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>>> 'app','tcp_status2','dhost'
>>> ]
>>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
>>>
>>> # vectorize the fields
>>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
>>> for t in TxtCols]
>>>
>>> # define the columns of sparse matrix
>>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
>>> TxtCols)] + \
>>> [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for
>>> n in NumCols])
>>>
>>> # target variable
>>> Y = M.act.values
>>>
>>> ## Define train/test parts and scale them
>>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
>>> scaler = StandardScaler(with_mean=False, with_std=True)
>>> scaler.fit(X_train)
>>> X_train=scaler.transform(X_train)
>>> X_test=scaler.transform(X_test)
>>>
>>>
>>> # define the model and train
>>> clf = MLPClassifier(activation='logistic',
>>> solver='lbfgs').fit(X_train,y_train)
>>> # use the model to predict on X_test and convert into a data frame
>>> df=pd.DataFrame(clf.predict(X_test))
>>>
>>> **
>>>
>>> 199845 OBSERVED
>>> 199846 OBSERVED
>>>
>>> [199847 rows x 1 columns]>
>>>
>>> **
>>>
>>> Now at the end I have a DataFrame with 20K entries with just one column
>>> "Label", how di I connect it to the main dataframe M, since I want to do
>>> some
>>> investigations on this outcome ?
>>>
>>> Any help?
>>>
>>> Thanks,
>>> Ruchika
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/4bad2358/attachment.html>
More information about the scikit-learn
mailing list