Done by Mehdi CHEBBAH
Table Of ContentsIntroductionWorking environmentAnacondaSpyderScikit-learnPandasBuilding the modelPhase I: Data-set preparationPhase II: TrainingPhase III: ValidationThe confusion matrixThe accuracyPhase IV: TrialConclusionBibliography & Webography
In what follows we will try to build a model that classifies comments (in Arabic) on a given product into 3 classes:
To make the classification we will use the Naïve Bayes algorithm which is a probabilistic classifier based on the bayes theorem. This method is among the (classic) methods most used to perform sentiment analysis or more generally natural language processing.
Anaconda is a utility for Python offering many features. It offers for example the possibility to install libraries and to use them in its programs, but also offers software to help developers to have a complete development environment quickly.
Spyder (named Pydee in its first versions) is a development environment for Python. Free (MIT license) and multi-platform, it integrates many libraries for scientific use Matplotlib, NumPy, SciPy and IPython.
Scikit-learn is a free Python library for machine learning. It is developed by many contributors, especially in the academic world by French higher education and research institutes like Inria and Télécom Paris. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, notably NumPy and SciPy.
Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is a free software under BSD license.
Regarding the data-set used in this analysis it is available here.
It is 100K (99999) reviews on different products, the dataset combines reviews of hotels, books, movies, products and some airlines.
It has three classes (mixed, negative and positive). Most were mapped from the reviewers' ratings, with 3 being mixed, above 3 positive, and below 3 negative. Each row has a label and text separated by a tab (tsv).
The text (notice) was cleaned by removing Arabic numerals and non-Arabic characters. There are no duplicate reviews in the dataset.
First, we want to import the data-set:
xxxxxxxxxx
2
1
import pandas as pd
2
dataset = pd.read_csv('ar_reviews_100k.tsv', sep='\t')
In order for the analysis to be of better quality, we need to do another data cleaning where we eliminate the empty words (Stop Words) that will falsify our analysis (or degrade the results)
A stop word is an insignificant word in a text. It is so common that it is useless to include it in the analysis.
Examples of stop words: أنا, كان, منذ, حتى, غير, و
To remove these words we will use a list of Arabic stop words available here.
xxxxxxxxxx
2
1
stopwords = pd.read_csv('ar_stopwords.txt', header = None)
2
dataset['text'] = dataset['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
Then we will split the data-set into Features (X) and Labels (y):
xxxxxxxxxx
2
1
X = dataset.iloc[:, 1].values
2
y = dataset.iloc[:, 0].values
Then we split our data-set into Train and Test.
xxxxxxxxxx
2
1
from sklearn.model_selection import train_test_split
2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
xxxxxxxxxx
5
1
from sklearn.feature_extraction.text import TfidfVectorizer
2
v = TfidfVectorizer()
3
v.max_features = 5000 # plus ce parametere est grand plus les performances sont mieux mais la phase du training devient consommatrice de la RAM
4
X_train = v.fit_transform(X_train).toarray()
5
X_test = v.transform(X_test).toarray()
Now after the preparation of the data-set we can easily build our model by running the following code:
xxxxxxxxxx
3
1
from sklearn.naive_bayes import MultinomialNB
2
classifier = MultinomialNB()
3
classifier.fit(X_train, y_train)
Note:
We can try using a different kernel For example the Gaussian but in practice the kernel that gives the best accuracy in our case is the Multinomial kernel.
We can try to test our model on the test-set (Cross-validation) by running the following code:
1y_pred = classifier.predict(X_test)
Then we can calculate the different measures of model performance for example:
1from sklearn.metrics import confusion_matrix
2
cm = confusion_matrix(y_test, y_pred, labels=['Positive', 'Mixed', 'Negative'])
1from sklearn.metrics import accuracy_score
2
accuracy = accuracy_score(y_test, y_pred)
You can also test the model manually on real comments for example:
We can further increase the accuracy of our model by eliminating more empty words and increasing the number of features used (we used 5000
in this model).
For this model to be really useful we need to invest in the dialect language since most comments in social networks are written in dialect which limits the use of the model.
We can also try to build a model that analyzes the sentiments of comments that contain emojis since they are widely used in comments.