Home

gensim

Gensim is an open-source Python library for topic modeling, document similarity, and vector space modeling. It is designed to process large text corpora using memory-efficient streaming and incremental algorithms, enabling scalable exploration of semantic structure in text.

The library provides implementations of core topic modeling algorithms such as Latent Dirichlet Allocation (LDA), Latent

Gensim supports offline and online training, including multi-core parallelization via LdaMulticore, and streaming interfaces that process

The project was started by Radim Rehurek and is maintained by a community of contributors. It is

Semantic
Indexing
(LSI),
and
related
probabilistic
models,
as
well
as
neural
and
statistical
word
embedding
models
such
as
Word2Vec,
Doc2Vec,
and
FastText.
It
also
contains
components
for
creating
and
manipulating
text
corpora,
dictionaries
(Dictionary),
and
corpus
formats
(MmCorpus),
plus
transformation
tools
like
TF-IDF,
and
similarity
queries.
data
in
chunks
rather
than
loading
everything
into
memory.
Its
emphasis
on
memory
efficiency
makes
it
suitable
for
very
large
collections
and
natural
language
processing
workflows
used
in
research
and
industry.
free
and
open-source
software,
released
under
the
LGPL,
and
is
available
from
the
Python
Package
Index
(PyPI)
and
the
project's
GitHub
repository,
with
extensive
documentation
and
tutorials.