TextCount
TextCount is a software project that provides tools for counting text elements in strings, files, or streams. It is designed to offer Unicode-aware counting across multiple dimensions, making it useful for developers, researchers, and data scientists who need precise text length measurements for validation, analytics, or preprocessing.
TextCount focuses on measuring characters, bytes, words, sentences, and tokens. It emphasizes correctness with Unicode text,
- Counts for characters, bytes, words, sentences, and tokens
- Unicode-aware processing with normalization and grapheme cluster support
- Locale- and rule-aware word boundaries
- Streaming and batch processing modes
- Pluggable tokenizers and bindings for multiple programming languages (such as Python, JavaScript, Java, and Go)
TextCount is commonly used to enforce input length constraints in APIs and user interfaces, to analyze
The core architecture centers on a tokenizer framework that can be extended with custom rules. The
Text processing, tokenization, natural language processing, word count.