Topic Modeling Libraries

 

 

 

 

Written by CHEBBAH Mehdi


Topic modeling (or topic extraction) is a technique in Natural Language Processing (NLP) that allows the machine to automatically extract meaning from text by identifying recurrent abstract themes or topics, represented generally by the most relevant keywords.

In this article I’ll be presenting some interesting libraries that implement different topic extraction techniques, I’ll explain the implemented techniques and the advantages and disadvantages of each implementation. Then I’ll present a use-case of topic modeling in real-life.

Scikit-learn

The first library on our list is Scikit-learn which is an open-source machine learning library that supports supervised and unsupervised learning. This library implements two Topic modeling algorithms: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). The pros and cons of this implementation to the algorithms are presented here:

Gensim

Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently and painlessly as possible. This library implements three topic extraction techniques: LDA, Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Process (HDP). The advantages and disadvantages are presented here:

Numpy ML

The next option is Numpy ML which is a growing collection of machine learning models, algorithms, and tools written exclusively in NumPy and the Python standard library. The only topic modeling algorithm implemented in this library is LDA. This implementation has its features and inconveniences:

Familia

Familia is an open-source project. A Toolkit for Industrial Topic Modeling. It implements LDA and SentenceLDA. Here are the pros and the cons:

Top2vec

Top2Vec is a python library designed to learn jointly embedded topics, documents, and word vectors. It is the only implementation of the Top2Vec algorithm presented in this paper. It is a promising approach that uses techniques of Deep Learning to solve this problem. Here are the benefits and drawbacks of this approach:

BERTopic

This library implements the BERTopic model (presented in this paper) which leverages the BERT model and Class-based TF-IDF (c-TF-IDF) to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. This model has a lot of features that make it the best model in this list but at the same time has its own drawbacks that make you think before using it.

Text2vec

text2vec is an R package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP). This library implements the LDA and LSA algorithms. The pros and cons of this implementation are presented below:

Real-Life Use-Case Of Topic Modeling

One use-case that I personally come across is when we were building a new Recommender System for scientific papers. This system was based on a graph where papers, authors, and topics are nodes, and relations between them are edges. And it functions as follows:

To implement this system we have to use a topic modeling model that extracts the topics from papers. Then use these topics to build the graph. We decided to go with Top2Vec to accomplish this task, and here is how we used it:

  1. First, the model takes the inputs (which are the abstracts of papers) and Embeds them using the Embedding model (The USE model in our case but any embedding model will fit here).
  2. The second step is to reduce the dimensionality of the Embeddings space using the UMAP algorithm to create dense areas. These dense areas contain geometrically close words (Semantically close Words).
  3. Cluster the results using a density-based clustering algorithm (The HDBSCAN is used here). Then calculate the centroids of each cluster (These centroids are considered as the representative vector of the topic).
  4. Get the K closest words to the center of each cluster using the K-NN algorithm (These words represent the keywords of each topic). The outputs of this model are the list of topics and keywords (that will be used to create the graph nodes and edges related to topics) and the trained model (that will be used later to cluster new papers into the right topic).

This was just one simple real-world use-case of topic modeling techniques that I have personally faced. There are tons of other use-cases that every NLP engineer would face at least once in his career.

Thank you for your reading

If you are interested in topic modeling algorithms in-depth you could read these papers: