Clustering
Clustering is a task in unsupervised learning and statistics that groups objects so that those within a cluster are more similar to each other than to objects in other clusters. The goal is to discover structure, patterns, or subgroups without predefined labels.
Clustering relies on similarity or distance and supports numerical, categorical, or mixed data. Algorithms fall into
Partitioning methods assign objects to a predefined number of clusters by optimizing cohesion. Hierarchical clustering builds
Model-based clustering assumes data arise from a mixture of distributions; parameters are inferred by methods such
Distance measures and scaling influence results. Common metrics include Euclidean, Manhattan, cosine, and Jaccard for binary
Evaluation uses internal indices, such as the silhouette coefficient and Davies-Bouldin index, or external indices like
Applications include market segmentation, image and text clustering, bioinformatics, anomaly detection, and social network analysis.
Limitations include choosing the number of clusters, sensitivity to outliers and scaling, difficulties in high-dimensional data,