Home

UTF8

UTF-8 is a variable-length character encoding for Unicode that has become the dominant encoding for text on the Internet. It encodes every Unicode code point using one to four bytes, with the first 128 code points identical to ASCII to preserve backward compatibility with existing text.

In UTF-8, ASCII characters use a single byte (0x00 to 0x7F). Multibyte sequences use leading bit patterns:

History and usage: UTF-8 was developed in the early 1990s as part of the Unicode standard and

Advantages and considerations: UTF-8 offers ASCII compatibility, variable length for efficient English text, self-synchronization, and no

See also Unicode, UTF-16, UTF-32, ASCII.

two-byte
sequences
start
with
110,
three-byte
sequences
start
with
1110,
and
four-byte
sequences
start
with
11110,
with
subsequent
bytes
beginning
10xxxxxx.
This
design
makes
UTF-8
self-synchronizing,
allows
streaming
processing,
and
avoids
byte
order
issues
that
affect
fixed-width
encodings.
the
UTF-8
family
of
encodings.
It
was
designed
to
be
compatible
with
ASCII,
support
all
Unicode
code
points,
and
facilitate
interchange
across
systems.
It
has
become
the
de
facto
standard
encoding
for
web
pages,
emails,
databases,
and
programming
environments,
and
is
widely
supported
across
platforms
and
languages.
mandatory
Byte
Order
Mark
in
many
contexts.
However,
the
length
of
a
string
in
bytes
may
differ
from
the
number
of
characters,
and
validating
or
sanitizing
input
is
important
to
detect
and
handle
invalid
sequences.
Some
languages
and
security
contexts
require
strict
validation
to
prevent
certain
classes
of
errors.