wordfrequency
Wordfrequency refers to how often individual words occur in a body of text or across a collection of texts (a corpus). It is a basic statistic used in linguistics, corpus linguistics, and natural language processing to characterize language use, identify common vocabulary, and support modeling tasks such as search and classification.
Measurement types: absolute frequency is the raw count f(w) of a word in a corpus; relative frequency
Word frequency distributions tend to follow Zipf's law: a small set of words account for a large
Applications: building lexicons and stop word lists; informing language models; improving information retrieval, text classification, and
Data sources and preprocessing: frequency data come from corpora such as general-purpose corpora (e.g., COCA), web
Tools: common software libraries include Python-based NLTK, spaCy, gensim, and scikit-learn; R packages such as quanteda