predictiontime
Predictiontime, or inference latency, is the time from when input data becomes available to when a deployed model outputs its prediction. It is a key metric for the responsiveness of machine learning systems and is distinct from training time, which measures the model development process.
Predictiontime is usually reported as average per-instance latency and as tail metrics such as p95 or p99
Several factors influence predictiontime: model size and architecture; numerical precision; input size and data formatting; preprocessing
Low predictiontime is critical for real-time or interactive applications such as live recommendations, fraud detection, robotics,