Home

Databias

Databias is a term used to describe systematic distortions in data that affect analyses, predictions, and decisions derived from that data. It can originate from how data are collected, labeled, stored, or processed, and often reflects real-world inequities or measurement errors. Databias is not only a property of the data but of the entire data pipeline, including sampling methods, recording practices, and preprocessing steps. It can operate alone or amplify existing model biases when used to train or validate algorithms.

Common sources include sampling bias (unrepresentative samples), measurement or labeling bias (inconsistent or subjective labels), historical

Detection and evaluation involve data audits, exploratory analysis, and fairness metrics; the use of datasets with

Impact and implications include the potential for unfair or inaccurate predictions in hiring, lending, criminal justice,

Governance and standards emphasize risk management, transparency, and auditing. Practices such as dataset documentation, model cards,

bias
(data
reflecting
past
inequalities),
aggregation
or
feature
selection
bias,
survivorship
bias,
and
recording
bias
(differences
in
data
capture
across
groups).
diverse
representations;
and
documentation
such
as
datasheets
for
datasets
and
data
provenance.
Mitigation
strategies
include
improving
data
collection
to
be
representative,
balancing
datasets
through
resampling
or
reweighting,
removing
or
adjusting
biased
features,
employing
debiasing
or
fair
representation
learning
techniques,
and
incorporating
monitoring
and
human
oversight
during
deployment.
healthcare,
and
content
moderation,
as
well
as
the
risk
of
misinforming
research
and
policy
decisions.
Addressing
databias
requires
explicit
goals,
stakeholder
involvement,
and
ongoing
evaluation.
differential
privacy,
and
responsible
data
collection
are
commonly
recommended,
with
continuous
updating
as
data
distributions
evolve.