ActorCritic
Actor-critic methods are a class of reinforcement learning algorithms that combine two function approximators: an actor that represents the policy and a critic that evaluates the policy by estimating value functions. The actor outputs a policy pi(a|s; theta), mapping states to a distribution over actions. The critic estimates a value function V(s; w) or Q(s,a; w), providing feedback used to improve the policy.
During learning, the critic is trained to minimize temporal-difference (TD) error between observed rewards and estimated
On-policy variants include Advantage Actor-Critic (A2C) and its asynchronous version A3C, where learning uses data collected
Actor-critic methods are prominent in continuous-action domains due to their ability to learn stochastic or deterministic