[scikit-learn] Text classification of large dataet

Tue Dec 19 09:38:12 EST 2017

Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*

*sys.setrecursionlimit(200000000)*

*random.seed(20000)*

*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*

*dataset=pd.concat([trainset1,trainset2])*

*dataset=dataset.dropna()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
*    dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.00008)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*

*del dataset*
*del countvec*
*del column*

*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*

*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*

I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*

*I have stucked*

*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*

Thanks
Ranjana
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/cbafad9c/attachment-0001.html>