Home

tekenset

Tekenset, or character set, is the collection of characters a system can represent, including letters, digits, symbols, punctuation, and control codes. It defines the repertoire and the numeric codes assigned to each character.

Distinguish repertoire vs encoding: A tekenset specifies the characters themselves (the repertoire) and their assigned code

Unicode widely adopted as a universal tekenset; Unicode defines a repertoire of over 140,000 characters; encoding

Practical implications: Font support and rendering depend on the font having glyphs for the included characters;

points.
An
encoding
maps
these
code
points
to
bytes
for
storage
or
transmission.
For
example,
ASCII
defines
128
characters
and
uses
7-bit
codes;
ISO/IEC
8859-1
(Latin-1)
extends
to
8-bit;
Windows-1252
is
a
common
Western
European
8-bit
tekenset.
forms
include
UTF-8,
UTF-16,
UTF-32.
UTF-8
is
variable-length
and
ASCII-compatible;
UTF-16
uses
16-bit
code
units
and
may
require
endianness
mark
(BOM).
Encoding
choice
affects
interoperability
and
software
behavior
across
platforms.
text
processing
may
involve
normalization
and
locale
considerations;
selecting
an
encoding
affects
data
interchange
and
compatibility
with
legacy
systems.
Mismatches
between
tekenset
and
encoding
can
lead
to
misinterpretation
of
text,
often
referred
to
as
mojibake.
In
practice,
choosing
a
robust
tekenset
and
encoding,
such
as
Unicode
with
UTF-8,
helps
ensure
broad
compatibility
and
correct
text
handling.