Done by CHEBBAH Mehdi & HAMMAS Ali Cherif
What is NLP?Data pre-processing in NLPTheoretically: In PracticeNLP analysis typesMachine Learning based (statistics)Supervised methodsUnsupervised methodsDeep Learning basedCNN
RNN
Building and testing the modelsComparison between modelsUsefulness of Spark
NLP (Natural Language Processing) is a branch of artificial intelligence which deals particularly with the processing of written language also called with the French name TALN (Traitement automatique du langage naturel) or TLN. In short, it is everything related to human language and its processing by automated tools.
The NLP can be divided into 2 main parts, the NLU (Natural Language Understanding) and the NLG (Natural Language Generation).
The first concerns the "comprehension" part of the text, taking a text as input and being able to extract data from it. This type is widely used in:
The second, is to generate text from the data, to be able to build coherent sentences automatically. This type is widely used in:
There are several data sources that can be used in the NLP process, for example Web scraping, Social networks, Databases, Real time data (Streaming), ...etc...
And depending on the source of the data (so its quality) we do the pre-processing. But globally there are 3 phases:
Processing missing values: This is a very important phase in the preparation of data for all types of models (NLP or other). There are several approaches to solve the problem of missing values without deleting them (because the deletion of these values can bias the model)
Annotate the data: This phase is usually done using human intelligence (several humans read the data and classify it according to predefined classes), or using unsupervised (or semi-supervised) Machine Learning algorithms to do the annotation.
Data cleansing: This phase depends on the data sources and data quality and also on the objective of the analysis. We can (as we can't) find the following treatments:
First, we will create a Spark session.
xxxxxxxxxx
51import findspark
2findspark.init('/opt/spark')
3
4from pyspark import SparkContext
5sc = SparkContext("local", "NLP App")
We used findspark
to initialize a Spark
environment in the conda
environment. Then initialize a SparkContext
.
Now we can import the data-sets
xxxxxxxxxx
41dataset_path = '/path/to/datasets/files/folder/'
2stopwords_path = '/path/to/stopwords/file/folder/'
3data = sc.textFile(dataset_path + "*.txt").map(lambda line: line.split("\t"))
4stopwords = sc.textFile(stopwords_path + "english").collect()
Then we prepare the data for pre-processing (we separate the documents from the annotations).
xxxxxxxxxx
21documents = data.map(lambda line: line[0])
2labels = data.map(lambda line: line[1])
The pre-processing of the data follows the following scheme:
xxxxxxxxxx
91def lower_clean_str(x):
2 punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
3 lowercased_str = x.lower()
4 for ch in punc:
5 lowercased_str = lowercased_str.replace(ch, ' ')
6 return lowercased_str.strip()
7
8import re
9documents = documents.map(lambda line: re.sub(" +", ' ', lower_clean_str(line)).split(" "))
Delete empty words.
xxxxxxxxxx
81def removeStopWords(words, stopwords):
2 return [x for x in words if x not in stopwords]
3
4documents = documents.map(lambda line: removeStopWords(line, stopwords))
5documents.take(2)
6# Resultats:
7# [['slow','moving','aimless','movie','distressed','drifting','young','man'],
8# ['sure','lost','flat','characters','audience','nearly','half','walked']]
Apply TF-IDF
xxxxxxxxxx
101from pyspark.mllib.feature import HashingTF, IDF
2hashingTF = HashingTF()
3tf = hashingTF.transform(documents)
4
5tf.cache()
6idf = IDF().fit(tf)
7tfidf = idf.transform(tf)
8
9idfIgnore = IDF(minDocFreq=2).fit(tf)
10tfidfIgnore = idfIgnore.transform(tf)
Prepare the required data structures for the creation of the model: for the training phase the model takes as input the structure RDD of LabeledPoints
.
xxxxxxxxxx
61from pyspark.mllib.regression import LabeledPoint
2tfidfWithIndexes = tfidfIgnore.zipWithIndex().map(lambda x: (x[1], x[0]))
3labelsWithIndexes = labels.zipWithIndex().map(lambda x: (x[1], x[0]))
4labelsWithIndexes.take(5)
5trainingData = tfidfWithIndexes.join(labelsWithIndexes).map(lambda x: LabeledPoint(x[1][1], x[1][0]))
6training, test = trainingData.randomSplit([0.7, 0.3])
There are two main categories of NLP (depending on the analysis algorithm):
Note:
In this work we will (practically) use a statistical model based on supervised algorithms. However, the methods of Deep Learning are only explained theoretically.
This category of algorithms is the most used because of its simplicity, we can find two approaches used for different objectives:
Requires the data set to be annotated. These methods are used for sentiment analysis, extraction of information from data, text classification, etc... The most commonly used algorithms are:
Does not require annotation of the data-sets. And they are used for morphology, sentence segmentation, text classification, lexical disambiguation, translation, ...etc.
Recently, Deep Learning has become one of the most used methods to solve complex learning problems since it allows not only to extend the limits of the models previously seen (statistical models) but also to give sometimes excellent results depending on the context in which it is used.
Although there is a lot of established researches so far, there are two approaches that are widely used for NLP (CNN
and RNNN
).
CNN
The CNN
( Convolutional Neural Network) is a type of artificial neural network (ANN
) which as its name implies is a set of neurons (representing weights) that are classified in layers
. The main difference between the CNN
and ANN
is that unlike the ANN
which relies on activation functions to move from one layer to another, the CNN
applies filters to the input data to extract features.
The idea of this approach consists first of all in cutting the sentences into words which are then transformed into a matrix of word integrations (the input matrix) of dimension d, then just after the input matrix is cut into several regions so that the different filters are then applied to the corresponding matrices, then a crucial step called "pooling" must be launched, which consists in carrying out transformations on the resulting matrices to be equal to a predefined size. There are two main reasons for this process:
whatever the size of the input matrix. At the end we will have the representation of the final phase which represents a classifier based on the extracted features.
In general, CNNs are efficient because they can extract semantic clues when it comes to the global context, but they have difficulty preserving sequential order and modeling long distance contextual information. Recurrent models are better suited for this type of learning and are discussed below.
RNN
RNNs
(Recurrent Neuron Networks) are neural networks that are specifically designed to perform very well when it comes to sequenced data and this gives them a very big advantage for NLP. RNNs are very good at processing sequenced data since they are based on a concept called Sequential Memory, which consists of learning things based on a mechanism that we humans use a lot in our lives, this is one of the most efficient methods to model it.
So if we ask someone to recite the alphabets in normal order.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
He won't have any difficulty to do it but if we give him an alphabet in the middle and we ask him to complete the sequence he will have some difficulties but right after he will be able to recite them very quickly and this since this person had learned the alphabets in sequence.
A more difficult challenge is to recite the alphabet in reverse order.
Z Y X W V U T S R Q P O N M L K J I H G F E D C B A
it becomes even more difficult even if everyone knows the alphabets, just the fact that the sequencing is not respected makes the task difficult and even sometimes impossible, the same thing is applied for RNNs
.
To be able to include this concept in these neural networks it is enough to take a simple ANN
, then in each layer an arc is created that allows to connect the output to the input, this way the data of the previous state will be added to the data of the current state.
Therefore, the main advantage of the Recurrent Neural Network is the possibility of giving a meaning to the word sequences in order to know precisely the subject and the context of that sentence, the best example where this model can be applied is a chat-bot because it allows to easily understand what the user wants through the sentence he has expressed in the input and afterwards the model will be able to define the best and most suitable answer in relation to what has been asked for.
We will build 3 models
xxxxxxxxxx
41from pyspark.mllib.classification import NaiveBayes
2model1 = NaiveBayes.train(training, 5.0)
3# le 2eme parametre c'est le parametre de lissage=5.0
4predictionAndLabel_NB = test.map(lambda p: (model1.predict(p.features), p.label))
xxxxxxxxxx
31from pyspark.mllib.classification import SVMWithSGD
2model2 = SVMWithSGD.train(training, iterations=100)
3predictionAndLabel_SVM = test.map(lambda p: (model2.predict(p.features), p.label))
xxxxxxxxxx
41from pyspark.mllib.tree import RandomForest
2model = RandomForest.trainClassifier(training, numClasses=2, numTrees=5, categoricalFeaturesInfo={}, featureSubsetStrategy="auto", maxDepth=4, maxBins=32)
3predictions = model.predict(test.map(lambda x: x.features))
4predictionAndLabel_RF = test.map(lambda lp: lp.label).zip(predictions)
Note:
The parameters used in the creation of the models are chosen in such a way that the accuracy of the models is the best possible.
We can compare the accuracy of the models as follows
xxxxxxxxxx
141def accuracy(predictionAndLabel):
2 return 1.0 * predictionAndLabel.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
3
4# NB
5print('model accuracy {}'.format(accuracy(predictionAndLabel_NB)))
6# Resultats: model accuracy 0.7817982456140351
7
8# SVM
9print('model accuracy {}'.format(accuracy(predictionAndLabel_SVM)))
10# Resultats: model accuracy 0.7872807017543859
11
12# RF
13print('model accuracy {}'.format(accuracy(predictionAndLabel_RF)))
14# Resultats: model accuracy 0.4868421052631579
We can clearly see that the most suitable models for natural language processing are Naïve Bayes and SVM with an accuracy of 78%.
The use of Spark has made a big difference in terms of:
RDD
) are easily shared on threads in the computer (or the cluster
if used in a network) which speeds up the work.API
data structures, Machine Learning and model performance calculation algorithms, powerful data mining methods.