Home

PhoB

PhoB is a family of pre-trained language models for Vietnamese developed by VinAI Research. Built on the Transformer encoder architecture, PhoB models aim to provide language-specific representations that improve natural language understanding for Vietnamese compared with multilingual models trained on mixed languages.

The core of the PhoB family is PhoBERT, which includes base and larger configurations. These models are

PhoB models are intended to serve as strong starting points for fine-tuning on downstream Vietnamese NLP tasks.

The project is maintained by VinAI Research and collaborators, with code and pretrained weights widely used

pre-trained
on
large-scale
Vietnamese
corpora
collected
from
diverse
sources
using
a
subword
tokenization
scheme
and
a
masked
language
modeling
objective
to
learn
contextual
representations.
They
are
released
with
pretrained
weights
and
tooling
compatible
with
common
deep
learning
frameworks,
making
them
accessible
for
research
and
development.
They
have
been
evaluated
on
a
range
of
tasks,
including
sentiment
analysis,
named
entity
recognition,
part-of-speech
tagging,
and
text
classification,
where
they
typically
achieve
strong
performance
relative
to
baseline
models
and
multilingual
alternatives.
The
approach
has
contributed
to
improvements
in
both
academic
research
and
practical
deployments
for
Vietnamese
language
processing.
in
the
Vietnamese
NLP
community.
PhoB
models
are
commonly
adopted
to
accelerate
development
of
domain-specific
applications,
enabling
researchers
and
developers
to
build
higher-quality
Vietnamese
NLP
systems.
See
also
related
Vietnamese
NLP
resources
and
pre-trained
models
in
the
broader
language-model
ecosystem.