Qérték
Qérték, commonly referred to as the Q-value, is a central concept in reinforcement learning. The Q-value function Qπ(s,a) assigns to each state s and action a the expected return, i.e., the expected sum of discounted rewards, when the agent takes action a in state s and subsequently follows a given policy π. Formally, Qπ(s,a) = Eπ[ ∑_{t=0}^∞ γ^t R_{t+1} | S0 = s, A0 = a ], where γ ∈ [0,1) is the discount factor. The optimal Q-value Q*(s,a) corresponds to the maximum obtainable return, and the greedy policy π*(s) = argmax_a Q*(s,a) is optimal under mild conditions.
Q-learning, a model-free algorithm introduced by Watkins and Dayan, iteratively updates Q-values toward the observed reward
In practice, Q-values can be stored in a table for small problems; large or continuous spaces require