Home

Lexing

Lexing, short for lexical analysis, is the process of converting a stream of characters into a stream of tokens, the basic units used by a compiler or interpreter to understand a programming language or data format. It is usually the first phase in language processing and operates before parsing. A lexer reads source text, matches character sequences against a set of token patterns, and emits tokens that carry a type and, when relevant, a value such as an identifier name or numeric literal.

Token kinds typically include keywords, identifiers, literals (numbers, strings), operators, punctuation, and sometimes comments or whitespace.

Two key principles guide lexing: the longest-match rule, which selects the token that matches the most characters

The lexer outputs a sequence of tokens to the parser, enabling syntactic analysis without direct access to

Whitespace
and
comments
are
often
discarded,
though
some
language
designs
emit
them
as
tokens
for
specific
tooling
needs.
Determining
token
boundaries
relies
on
pattern
matching,
commonly
implemented
with
regular
expressions
and
a
finite
automaton.
at
the
current
position,
and
a
pattern
priority
that
resolves
conflicts
when
multiple
patterns
could
apply.
If
no
pattern
matches
the
current
input
character,
a
lexical
error
is
reported.
Distinguishing
keywords
from
identifiers
is
another
common
task:
an
identifier
that
matches
a
reserved
word
is
treated
as
the
corresponding
keyword.
raw
text.
Lexers
also
track
metadata
such
as
line
and
column
numbers
to
aid
error
reporting.
Tools
such
as
Lex,
Flex,
or
ANTLR
can
generate
lexers
from
lexical
specifications,
though
many
languages
also
permit
hand-written
scanners.