Written by CHEBBAH Mehdi
Topic modeling (or topic extraction) is a technique in Natural Language Processing (NLP) that allows the machine to automatically extract meaning from text by identifying recurrent abstract themes or topics, represented generally by the most relevant keywords.
In this article I’ll be presenting some interesting libraries that implement different topic extraction techniques, I’ll explain the implemented techniques and the advantages and disadvantages of each implementation. Then I’ll present a use-case of topic modeling in real-life.
The first library on our list is Scikit-learn which is an open-source machine learning library that supports supervised and unsupervised learning. This library implements two Topic modeling algorithms: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). The pros and cons of this implementation to the algorithms are presented here:
Advantages
Disadvantages
Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently and painlessly as possible. This library implements three topic extraction techniques: LDA, Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Process (HDP). The advantages and disadvantages are presented here:
Advantages
Disadvantages
The next option is Numpy ML which is a growing collection of machine learning models, algorithms, and tools written exclusively in NumPy and the Python standard library. The only topic modeling algorithm implemented in this library is LDA. This implementation has its features and inconveniences:
Advantages
Disadvantages
Familia is an open-source project. A Toolkit for Industrial Topic Modeling. It implements LDA and SentenceLDA. Here are the pros and the cons:
Advantages
Disadvantages
Top2Vec is a python library designed to learn jointly embedded topics, documents, and word vectors. It is the only implementation of the Top2Vec algorithm presented in this paper. It is a promising approach that uses techniques of Deep Learning to solve this problem. Here are the benefits and drawbacks of this approach:
Advantages
Disadvantages
This library implements the BERTopic model (presented in this paper) which leverages the BERT model and Class-based TF-IDF (c-TF-IDF) to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. This model has a lot of features that make it the best model in this list but at the same time has its own drawbacks that make you think before using it.
Advantages
Disadvantages
text2vec is an R package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP). This library implements the LDA and LSA algorithms. The pros and cons of this implementation are presented below:
Advantages
Disadvantages
One use-case that I personally come across is when we were building a new Recommender System for scientific papers. This system was based on a graph where papers, authors, and topics are nodes, and relations between them are edges. And it functions as follows:
To implement this system we have to use a topic modeling model that extracts the topics from papers. Then use these topics to build the graph. We decided to go with Top2Vec to accomplish this task, and here is how we used it:
This was just one simple real-world use-case of topic modeling techniques that I have personally faced. There are tons of other use-cases that every NLP engineer would face at least once in his career.
If you are interested in topic modeling algorithms in-depth you could read these papers: