ECFP4
ECFP4, short for Extended-Connectivity Fingerprint with diameter 4, is a molecular fingerprint used to encode chemical structures for similarity searching and machine learning. It represents molecules as a fixed-length vector by encoding circular substructures around each atom up to radius 2 (diameter 4). Each atom-centered neighborhood is hashed into an identifier, and the collection of identifiers is folded into a bit vector or a count-based vector.
Generation: Starting from every atom, the method iteratively labels environments of increasing radius (0 to 2)
Usage: ECFP4 is widely used for similarity search with the Tanimoto coefficient and as a feature representation
Variants and software: ECFP4 is part of a family that includes ECFP6 (diameter 6) and related fingerprints
Notes: Because fingerprints are hashed, collisions can occur, and the choice of vector length affects sparsity