andmekomplekti
A data set (andmekomplekti) is a structured collection of data records used for analysis, modeling, or reporting. In machine learning, a data set typically consists of samples (rows) and features (columns), with a possible target variable for supervised tasks.
Components include features describing attributes, samples representing instances, and, in supervised contexts, a label column. Metadata
Formats and storage: Datasets are stored in formats such as CSV, JSON, Parquet, or within databases. They
Quality and preprocessing: Datasets may contain missing values, inconsistent entries, or outliers. Preprocessing steps include cleaning,
Usage and ethics: Datasets are used for data analysis, ML model training, and benchmarking. Ethical and legal
Examples and limitations: Public datasets support research and education but can introduce bias if not representative.