rareword
Rareword is a concept used in linguistics and corpus linguistics to describe lexical items that occur with very low frequency in a given language sample. It is a descriptive label rather than a formal category, applied in studies of vocabulary size, lexical diversity, and language processing. Rareword is related to, but broader than, hapax legomena, which are words that appear only once.
Measurement of rarity relies on frequency data from corpora. Common approaches define rareword by absolute counts
Identification typically requires preprocessing steps such as tokenization, normalization, and lemmatization. Researchers use large, representative corpora
Applications include natural language processing and information retrieval, where rarewords can challenge language models, spelling, and
Examples of rarewords depend on the corpus. In general English corpora, long specialised terms like floccinaucinihilipilification