vocabsize - Infinite Lexicon - Infinite Lexicon

vocabsize

Vocabsize refers to the size of the vocabulary used by a language model or the number of distinct tokens in a text corpus after preprocessing. It depends on tokenization and normalization choices and whether punctuation, digits, or special tokens are counted. In corpus linguistics, vocabsize serves as a measure of lexical diversity and typically grows with corpus size, often following sublinear growth patterns described by Heaps' law.

In machine learning, vocabsize is often a model parameter that determines the number of embeddings and the

Counting rules vary: some pipelines include punctuation; others remove it. Special tokens like <PAD>, <UNK>, and

Vocabsize is context-dependent and shaped by preprocessing choices. It is a practical parameter rather than an

A

a

out-of-vocabulary