surfaceform
Surface form, also written as surfaceform, is a term used in linguistics and natural language processing to denote the exact textual representation of a word or phrase as it appears in running text. It is distinguished from its lemma or canonical form, which is the base or dictionary form. A surface form includes capitalization, diacritics, punctuation, hyphenation, and spacing, and can cover single words or multiword expressions. For example, "New York" and "new york" are two surface forms of the same underlying entity, and "Barack Obama" is a surface form that may map to different representations in different contexts.
The concept is central to many text processing tasks such as normalization, tokenization, and named entity
Limitations include ambiguity, linguistic variation, and encoding issues that affect whether two surface forms should be