Home

BERTopic

BERTopic is a topic modeling approach and open-source Python library designed to discover topics in large collections of text. It combines modern neural sentence embeddings with traditional clustering to produce coherent topics, even from short texts, without requiring bag-of-words representations.

At its core, BERTopic transforms documents into dense vector embeddings using transformer models (for example, sentence-transformers).

The workflow typically involves: computing embeddings for documents, reducing and clustering, extracting topic representations, and assigning

BERTopic is language-agnostic when used with suitable multilingual transformer models, and it integrates with common Python

Applications include analyzing large text collections in business, research, journalism, and social media to discover themes,

It
then
reduces
the
dimensionality
of
the
embeddings
with
UMAP
and
clusters
them
with
HDBSCAN
to
form
topic
groups.
For
each
cluster,
it
derives
a
representative
term
list
using
c-TF-IDF,
an
adapted
class-based
TF-IDF
weighting
to
produce
human-interpretable
topic
labels.
documents
to
topics.
The
library
offers
utilities
for
labeling
topics,
checking
topic
coherence,
and
visualizing
relationships
among
topics
and
documents.
data
science
stacks.
It
is
designed
to
scale
to
large
corpora
and
to
provide
interpretable
topics
without
extensive
preprocessing.
track
topics
over
time,
or
summarize
content.
Limitations
include
computational
demands
and
sensitivity
to
model
choice
and
hyperparameters;
topic
quality
depends
on
embedding
quality
and
clustering
results,
and
tuning
may
be
required
for
noisy
or
highly
diverse
corpora.