Q-learning
Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment. It can handle problems with stochastic transitions and rewards without requiring adaptations.
For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time.
For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy.
"Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state.
Reinforcement learning
Reinforcement learning involves an agent, a set of states, and a set of actions per state. By performing an action, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward.The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state.
As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding. One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:
- 0 seconds wait time + 15 seconds fight time
- 5 second wait time + 0 second fight time
Algorithm
After steps into the future the agent will decide some next step. The weight for this step is calculated as, where is a number between 0 and 1. Assuming, it has the effect of valuing rewards received earlier higher than those received later. may also be interpreted as the probability to succeed at every step.The algorithm, therefore, has a function that calculates the quality of a state–action combination:
Before learning begins, is initialized to a possibly arbitrary fixed value. Then, at each time the agent selects an action, observes a reward, enters a new state , and is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information:
where is the reward received when moving from the state to the state, and is the learning rate.
Note that is the sum of three terms:
- : the current value
- : the reward to obtain if action is taken when in state
- : the maximum reward that can be obtained from state
For all final states, is never updated, but is set to the reward value observed for state. In most cases, can be taken to equal zero.
Influence of variables
Learning rate
The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing, while a factor of 1 makes the agent consider only the most recent information. In fully deterministic environments, a learning rate of is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as for all.Discount factor
The discount factor determines the importance of future rewards. A factor of 0 will make the agent "myopic" by only considering current rewards, i.e. , while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.Initial conditions (''Q''0)
Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates reset of initial conditions is expected to predict participants' behavior better than a model that assumes any arbitrary initial condition. RIC seems to be consistent with human behaviour in repeated binary choice experiments.Implementation
Q-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions since the likelihood of the agent visiting a particular state and performing a particular action is increasingly small.Function approximation
Q-learning can be combined with function approximation. This makes it possible to apply the algorithm to larger problems, even when the state space is continuous.One solution is to use an artificial neural network as a function approximator. Another possibility is to integrate Fuzzy Rule Interpolation and use sparse fuzzy rule-bases instead of discrete Q-tables or ANNs, which has the advantage of being a human-readable knowledge representation form. Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.
Quantization
Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity, the angle of the stick and the angular velocity of the stick. This yields a four-element vector that describes one state, i.e. a snapshot of one state encoded into four values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values can be assigned to a bucket. The exact distance of the finger from its starting position is not known, but rather whether it is far away or not.History
Q-learning was introduced by Chris Watkins in 1989. A convergence proof was presented by Watkins and Peter Dayan in 1992, building on Watkins' doctoral dissertation, Learning from Delayed Rewards. Eight years earlier in 1981, the same problem, under the name of "Delayed reinforcement learning," was solved by Bozinovski's Crossbar Adaptive Array. The memory matrix was the same as the eight years later Q-table of Q-learning. The architecture introduced the term "state evaluation" in reinforcement learning. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation:- In state perform action ;
- Receive consequence state ;
- Compute state evaluation ;
- Update crossbar value.
In 2014, Google DeepMind patented an application of Q-learning to deep learning, entitled "deep reinforcement learning" or "deep Q-learning," that can play Atari 2600 games at expert human levels.
Variants
Deep Q-learning
The DeepMind system used a deep convolutional neural network, with layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications.The technique used experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed. This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.