tokeniser
A tokeniser, also known as a lexical analyser or scanner, is a fundamental component in computer science, particularly in the fields of compilers and natural language processing. Its primary function is to break down a stream of characters, typically source code or text, into smaller, meaningful units called tokens. These tokens represent the basic building blocks of the input and are essential for subsequent processing stages.
In the context of programming languages, a tokeniser identifies elements such as keywords (e.g., `if`, `while`),
In natural language processing, tokenisation involves splitting text into words, punctuation marks, or other significant elements.