Home

charsets

Charsets, short for character sets, are systems that map characters used in written language to numeric codes. They enable computers to store and transmit text by representing each character as a number, typically a byte or a sequence of bytes.

A charset is not the same thing as an encoding, though the two terms are often used

Unicode is a universal character set that assigns a unique code point to each character across the

Common legacy and regional charsets include ASCII (7-bit), ISO-8859-1 (Latin-1), Windows-1252, Shift JIS, EUC-KR, GB2312, Big5,

Standards organizations maintain charsets and encodings. Unicode is defined by the Unicode Consortium and ISO/IEC 10646.

Practical issues include mojibake from decoding with the wrong charset, endianness differences in multi-byte encodings, and

Best practice is to use Unicode encodings (prefer UTF-8) for new data, declare the encoding in interfaces

interchangeably.
An
encoding
specifies
how
those
numeric
codes
are
converted
to
bytes
and
back,
while
a
charset
defines
the
repertoire
of
characters
and
their
assigned
numbers.
world's
scripts.
To
store
Unicode
text,
encodings
such
as
UTF-8,
UTF-16,
or
UTF-32
are
used.
UTF-8
is
widely
adopted
on
the
web
for
its
ASCII
compatibility
and
efficiency.
and
KOI8-R.
These
8-bit
or
multi-byte
schemes
were
designed
for
particular
languages
or
regions
and
are
increasingly
supplanted
by
Unicode.
The
IANA
registry
lists
charset
names,
and
software
typically
selects
an
encoding
via
HTTP
headers,
file
metadata,
or
content-type
declarations.
byte-order
marks.
Normalization
and
combining
characters
can
affect
text
comparison
and
rendering
across
systems.
and
documents,
and
normalize
input
to
a
consistent
form.
This
improves
interoperability
and
reduces
encoding-related
errors.