Home

UTF

UTF stands for Unicode Transformation Format, a family of encodings for Unicode code points. It translates the characters defined by the Unicode standard into sequences of bytes for storage, transmission, and processing. The most widely used forms are UTF-8, UTF-16, and UTF-32. All three are designed to represent the full range of Unicode code points, up to U+10FFFF, and to interoperate with existing text-processing systems.

UTF-8 is a variable-length encoding that uses one to four bytes per code point. Code points in

UTF-16 uses 16-bit code units. Code points in the Basic Multilingual Plane (U+0000 to U+FFFF) fit in

UTF-32 uses fixed 32-bit code units, with each Unicode code point mapped directly to a single 4-byte

Endianness, normalization, and compatibility considerations influence how UTF forms are implemented in software and protocols. UTF

the
ASCII
range
(U+0000
to
U+007F)
are
encoded
as
a
single
byte
identical
to
ASCII.
Other
code
points
use
multi-byte
sequences
with
distinct
leading-bit
patterns.
UTF-8
is
backward
compatible
with
ASCII,
does
not
require
a
byte
order
mark
for
binary
data,
and
is
the
dominant
encoding
for
web
content
and
most
modern
data
formats.
one
unit;
code
points
above
U+FFFF
are
encoded
using
a
pair
of
16-bit
units
called
surrogates.
UTF-16
can
be
encoded
in
little-endian
or
big-endian
byte
order
and
may
use
a
byte
order
mark
to
indicate
endianness.
It
is
commonly
used
in
Windows
and
in
some
programming
environments
such
as
Java.
value.
This
makes
random
access
simple
but
results
in
larger
file
sizes,
so
UTF-32
is
less
common
for
general
text
storage.
It
is
used
in
some
internal
applications
where
simple
indexing
is
important.
encodings
are
standardized
as
part
of
the
Unicode
specification
and
related
ISO/IEC
10646
standards,
and
are
used
to
exchange
most
global
text
across
networks,
filesystems,
and
programming
languages.