dictstr
Dictstr is a conceptual data structure and encoding format designed to compactly store strings by exploiting redundancy through a shared dictionary of substrings. In a dictstr representation, the data consists of a dictionary that maps frequently occurring substrings to compact tokens, and a sequence for each string that references those tokens along with any literals not present in the dictionary. The dictionary can be built from a corpus of strings (static dictionary) or grow dynamically as new substrings appear (adaptive dictionary). Encoding a string involves replacing occurrences of dictionary substrings with their tokens and emitting any remaining characters as literals. Decoding uses the token stream and the dictionary to reconstruct the original strings.
Dictstr is particularly useful when a collection contains many overlapping substrings, such as log files, natural-language
Performance characteristics vary with dictionary design and data characteristics. Strong redundancy yields high compression, while highly
See also dictionary encoding, Lempel–Ziv, tokenization, and trie-based data structures.