[scikit-learn] Text classification of large dataet
Ranjana Girish
ranjanagirish30 at gmail.com
Tue Dec 19 09:38:12 EST 2017
Hai all,
I am doing text classification. I have around 10 million data to be
classified to around 7k category.
Below is the code I am using
*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*
*sys.setrecursionlimit(200000000)*
*random.seed(20000)*
*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*
*dataset=pd.concat([trainset1,trainset2])*
*dataset=dataset.dropna()*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*
*del trainset1*
*del trainset2 *
*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
* dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*
*countvec = CountVectorizer(min_df=0.00008)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*
*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*
*del dataset*
*del countvec*
*del column*
*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*
*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*
I am using system with *128 GB RAM.*
As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million
Still I was unable to Train 2.3 million data
I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*
*I have stucked*
*Can Anyone please tell whether any memory leak in my code and how to use
system with 128 GB RAM effectively*
Thanks
Ranjana
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171219/cbafad9c/attachment-0001.html>
More information about the scikit-learn
mailing list