Home

Onehot

One-hot encoding is a method for converting categorical data into a numerical format suitable for machine learning. In one-hot encoding, each category is represented by a binary vector that has the same length as the number of categories in the feature. The vector contains exactly one '1' (the hot position) and all other elements are '0'. The position of the '1' identifies the category.

For example, for a feature with three categories — red, green, blue — red becomes [1, 0, 0],

One-hot encoded features are non-ordinal; the encoding does not imply any ordering between categories. They are

Advantages of one-hot encoding include simplicity and little risk of introducing spurious ordinal relationships. Disadvantages include

To address high cardinality, practitioners may use strategy variants such as embedding representations, target encoding, or

green
[0,
1,
0],
and
blue
[0,
0,
1].
widely
used
as
input
features
for
algorithms
that
require
numerical
input,
such
as
linear
models,
tree-based
methods,
and
neural
networks.
They
are
also
used
in
natural
language
processing
to
represent
words
in
a
vocabulary
as
distinct
tokens,
and
in
classification
tasks
to
denote
the
target
class
in
a
mutually
exclusive
setting.
high
dimensionality
for
features
with
large
numbers
of
categories
and
sparsity,
since
most
values
are
zeros.
It
also
does
not
capture
similarity
between
categories.
hashing
tricks.
In
statistical
contexts,
the
approach
is
related
to
dummy
coding
or
indicator
variables.
Many
software
packages
provide
one-hot
encoding
utilities,
such
as
scikit-learn's
OneHotEncoder
or
pandas
get_dummies;
it
is
important
to
handle
unknown
categories
consistently
between
training
and
deployment.