NLP with Spark

Sentiment analysis in comments

 

 

 

 

Done by CHEBBAH Mehdi & HAMMAS Ali Cherif

 

 

 

Table of contents

What is NLP?

NLP (Natural Language Processing) is a branch of artificial intelligence which deals particularly with the processing of written language also called with the French name TALN (Traitement automatique du langage naturel) or TLN. In short, it is everything related to human language and its processing by automated tools.

The NLP can be divided into 2 main parts, the NLU (Natural Language Understanding) and the NLG (Natural Language Generation).

 

Data pre-processing in NLP

Theoretically:

There are several data sources that can be used in the NLP process, for example Web scraping, Social networks, Databases, Real time data (Streaming), ...etc...

And depending on the source of the data (so its quality) we do the pre-processing. But globally there are 3 phases:

In Practice

First, we will create a Spark session.

We used findspark to initialize a Spark environment in the conda environment. Then initialize a SparkContext.

Now we can import the data-sets

Then we prepare the data for pre-processing (we separate the documents from the annotations).

  1. The pre-processing of the data follows the following scheme:

    1. Convert documents to lowercase.
    2. Remove punctuation.
    3. Remove additional white space.
    4. Tokenization (in words).
  2. Delete empty words.

  3. Apply TF-IDF

  4. Prepare the required data structures for the creation of the model: for the training phase the model takes as input the structure RDD of LabeledPoints.

 

NLP analysis types

There are two main categories of NLP (depending on the analysis algorithm):

Note:

In this work we will (practically) use a statistical model based on supervised algorithms. However, the methods of Deep Learning are only explained theoretically.

 

Machine Learning based (statistics)

This category of algorithms is the most used because of its simplicity, we can find two approaches used for different objectives:

Supervised methods

Requires the data set to be annotated. These methods are used for sentiment analysis, extraction of information from data, text classification, etc... The most commonly used algorithms are:

Unsupervised methods

Does not require annotation of the data-sets. And they are used for morphology, sentence segmentation, text classification, lexical disambiguation, translation, ...etc.

 

Deep Learning based

Recently, Deep Learning has become one of the most used methods to solve complex learning problems since it allows not only to extend the limits of the models previously seen (statistical models) but also to give sometimes excellent results depending on the context in which it is used.

Although there is a lot of established researches so far, there are two approaches that are widely used for NLP (CNN and RNNN).

CNN

The CNN ( Convolutional Neural Network) is a type of artificial neural network (ANN) which as its name implies is a set of neurons (representing weights) that are classified in layers. The main difference between the CNN and ANN is that unlike the ANN which relies on activation functions to move from one layer to another, the CNN applies filters to the input data to extract features.

The idea of this approach consists first of all in cutting the sentences into words which are then transformed into a matrix of word integrations (the input matrix) of dimension d, then just after the input matrix is cut into several regions so that the different filters are then applied to the corresponding matrices, then a crucial step called "pooling" must be launched, which consists in carrying out transformations on the resulting matrices to be equal to a predefined size. There are two main reasons for this process:

  1. To give a fixed size to the output matrix
  2. Reduce the size of the output matrix

whatever the size of the input matrix. At the end we will have the representation of the final phase which represents a classifier based on the extracted features.

In general, CNNs are efficient because they can extract semantic clues when it comes to the global context, but they have difficulty preserving sequential order and modeling long distance contextual information. Recurrent models are better suited for this type of learning and are discussed below.

RNN

RNNs (Recurrent Neuron Networks) are neural networks that are specifically designed to perform very well when it comes to sequenced data and this gives them a very big advantage for NLP. RNNs are very good at processing sequenced data since they are based on a concept called Sequential Memory, which consists of learning things based on a mechanism that we humans use a lot in our lives, this is one of the most efficient methods to model it.

So if we ask someone to recite the alphabets in normal order.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

He won't have any difficulty to do it but if we give him an alphabet in the middle and we ask him to complete the sequence he will have some difficulties but right after he will be able to recite them very quickly and this since this person had learned the alphabets in sequence.

A more difficult challenge is to recite the alphabet in reverse order.

Z Y X W V U T S R Q P O N M L K J I H G F E D C B A

it becomes even more difficult even if everyone knows the alphabets, just the fact that the sequencing is not respected makes the task difficult and even sometimes impossible, the same thing is applied for RNNs.

To be able to include this concept in these neural networks it is enough to take a simple ANN, then in each layer an arc is created that allows to connect the output to the input, this way the data of the previous state will be added to the data of the current state.

Therefore, the main advantage of the Recurrent Neural Network is the possibility of giving a meaning to the word sequences in order to know precisely the subject and the context of that sentence, the best example where this model can be applied is a chat-bot because it allows to easily understand what the user wants through the sentence he has expressed in the input and afterwards the model will be able to define the best and most suitable answer in relation to what has been asked for.

 

Building and testing the models

We will build 3 models

Note:

The parameters used in the creation of the models are chosen in such a way that the accuracy of the models is the best possible.

 

Comparison between models

We can compare the accuracy of the models as follows

We can clearly see that the most suitable models for natural language processing are Naïve Bayes and SVM with an accuracy of 78%.

 

Usefulness of Spark

The use of Spark has made a big difference in terms of: