FP16 - Infinite Lexicon - Infinite Lexicon

FP16

FP16, or half-precision floating point, is a 16-bit format defined by IEEE 754-2008 (binary16). It uses 1 sign bit, 5 exponent bits, and 10 fraction bits, with a bias of 15. Exponent field values 1–30 denote normal numbers; 0 denotes zero or subnormal, and 31 denotes infinity or NaN. The maximum finite normal value is (2−2^−10)×2^15 (about 6.5504×10^4); the smallest positive normal is 2^−14. Subnormals fill the range below normal numbers, down to about 5.96×10^−8.

FP16 reduces memory use and can increase throughput on hardware that supports it, at the cost of

Hardware and software support: Many GPUs provide native FP16 arithmetic, notably NVIDIA GPUs with Tensor Cores

Applications: FP16 is widely used in deep learning for model inference and training where memory and bandwidth

Mixed-precision

mixed-precision

signal-processing

mixed-precision