dataskew

Dataskew, more commonly called data skew, is a term used in data management and distributed computing to describe an imbalance in how data are distributed across partitions, processes, or time. When data are not evenly distributed, some tasks must handle disproportionately large shares of work while others do little, leading to inefficiency and bottlenecks.

There are two related senses of the term. In statistics, skewness describes asymmetry in a data distribution,

Causes of data skew in practice include uneven key frequencies in partitioning schemes, long-tail or bursty

The effects of data skew are increased latency and wasted resources. Some nodes or workers become overloaded

Detection typically involves data sampling and examination of distribution statistics. In a processing system, operators may

Mitigation strategies include changing partitioning schemes (hash versus range), salting keys to distribute load more evenly,

a

a

a

characteristics,