State–action–reward–state–action
State–action–reward–state–action is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning". The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.
This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S1", the action the agent chooses "A1", the reward "R2" the agent gets for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" the agent chooses in its new state. The acronym for the quintuple is SARSA. Some authors use a slightly different convention and write the quintuple, depending on which time step the reward is formally assigned. The rest of the article uses the former convention.
Algorithm
A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation.Watkin's Q-learning updates an estimate of the optimal state-action value function based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy.
Some optimizations of Watkin's Q-learning may be applied to SARSA.