Home

wordbreaking

Word breaking is the process of determining where a line of text may be broken to wrap content to a new line. It identifies permissible break points between characters, between words, and in some cases within words. In languages that use spaces to separate words, line breaks typically occur at whitespace or punctuation. In languages without explicit word boundaries, such as Chinese or Japanese, line breaking relies on scripts' typographic rules to determine boundaries between characters. Hyphenation is related but distinct: it refers to breaking a word at syllable or morpheme boundaries, often with a hyphen inserted. Soft hyphen characters and zero-width joiners can influence potential break points without visible marks.

Break rules are implemented by line-breaking algorithms. In Unicode, the line breaking algorithm (UAX #14) defines

In practice, word breaking affects text layout in word processors, web browsers, and typesetting systems. It

where
breaks
may
occur
for
rendering
text,
while
UAX
#29
addresses
word
boundaries
for
segmentation,
useful
for
NLP.
Common
approaches
include
greedy
wrapping,
which
chooses
the
first
permissible
break
from
the
current
line,
and
optimal
wrapping,
which
minimizes
raggedness
using
dynamic
programming.
Font
metrics,
writing
direction,
and
script-specific
rules
affect
decisions.
For
display,
engines
consider
hard
vs
soft
breaks,
the
presence
of
non-breaking
spaces,
and
hyphenation
dictionaries
or
patterns;
CJK
line
breaking
uses
character-based
rules
that
differ
from
alphabetic
languages.
influences
readability,
justification,
and
accessibility.
Developers
may
tune
behavior
with
typographic
settings
or
Unicode
properties,
and
designers
must
account
for
multilingual
content
and
scripts
with
different
breaking
conventions.