Conjugate gradient method
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.
The conjugate gradient method can also be used to solve unconstrained optimization problems such as energy minimization. It is commonly attributed to Magnus Hestenes and Eduard Stiefel, who programmed it on the Z4, and extensively researched it.
The biconjugate gradient method provides a generalization to non-symmetric matrices. Various nonlinear conjugate gradient methods seek minima of nonlinear optimization problems.
Description of the problem addressed by conjugate gradients
Suppose we want to solve the system of linear equationsfor the vector, where the known matrix is symmetric, positive-definite, and real, and is known as well. We denote the unique solution of this system by.
Derivation as a direct method
The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems. Despite differences in their approaches, these derivations share a common topic—proving the orthogonality of the residuals and conjugacy of the search directions. These two properties are crucial to developing the well-known succinct formulation of the method.We say that two non-zero vectors and are conjugate if
Since is symmetric and positive-definite, the left-hand side defines an inner product
Two vectors are conjugate if and only if they are orthogonal with respect to this inner product. Being conjugate is a symmetric relation: if is conjugate to, then is conjugate to. Suppose that
is a set of mutually conjugate vectors with respect to, i.e. for all.
Then forms a basis for, and we may express the solution of in this basis:
Left-multiplying the problem with the vector yields
and so
This gives the following method for solving the equation : find a sequence of conjugate directions, and then compute the coefficients.
As an iterative method
If we choose the conjugate vectors carefully, then we may not need all of them to obtain a good approximation to the solution. So, we want to regard the conjugate gradient method as an iterative method. This also allows us to approximately solve systems where is so large that the direct method would take too much time.We denote the initial guess for by . Starting with we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution . This metric comes from the fact that the solution is also the unique minimizer of the following quadratic function
The existence of a unique minimizer is apparent as its Hessian matrix of second derivatives is symmetric positive-definite
and that the minimizer solves the initial problem follows from its first derivative
This suggests taking the first basis vector to be the negative of the gradient of at. The gradient of equals. Starting with an initial guess, this means we take. The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method. Note that is also the residual provided by this initial step of the algorithm.
Let be the residual at the th step:
As observed above, is the negative gradient of at, so the gradient descent method would require to move in the direction rk. Here, however, we insist that the directions must be conjugate to each other. A practical way to enforce this is by requiring that the next search direction be built out of the current residual and all previous search directions. The conjugation constraint is an orthonormal-type constraint and hence the algorithm can be viewed as an example of Gram-Schmidt orthonormalization. This gives the following expression:
. Following this direction, the next optimal location is given by
with
where the last equality follows from the definition of .
The expression for can be derived if one substitutes the expression for xk+1 into f and minimizing it with respect to
The resulting algorithm
The above algorithm gives the most straightforward explanation of the conjugate gradient method. Seemingly, the algorithm as stated requires storage of all previous searching directions and residue vectors, as well as many matrix–vector multiplications, and thus can be computationally expensive. However, a closer analysis of the algorithm shows that is orthogonal to, i.e., for. And is -orthogonal to, i.e., for. This can be regarded that as the algorithm progresses, and span the same Krylov subspace, where form the orthogonal basis with respect to the standard inner product, and form the orthogonal basis with respect to the inner product induced by. Therefore, can be regarded as the projection of on the Krylov subspace.That is, if the CG method starts with, thenThe algorithm is detailed below for solving where is a real, symmetric, positive-definite matrix. The input vector can be an approximate initial solution or. It is a different formulation of the exact procedure described above.
This is the most commonly used algorithm. The same formula for is also used in the Fletcher–Reeves nonlinear conjugate gradient method.
Restarts
We note that is computed by the gradient descent method applied to. Setting would similarly make computed by the gradient descent method from, i.e., can be used as a simple implementation of a restart of the conjugate gradient iterations. Restarts could slow down convergence, but may improve stability if the conjugate gradient method misbehaves, e.g., due to round-off error.Explicit residual calculation
The formulas and, which both hold in exact arithmetic, make the formulas and mathematically equivalent. The former is used in the algorithm to avoid an extra multiplication by since the vector is already computed to evaluate. The latter may be more accurate, substituting the explicit calculation for the implicit one by the recursion subject to round-off error accumulation, and is thus recommended for an occasional evaluation.A norm of the residual is typically used for stopping criteria. The norm of the explicit residual provides a guaranteed level of accuracy both in exact arithmetic and in the presence of the rounding errors, where convergence naturally stagnates. In contrast, the implicit residual is known to keep getting smaller in amplitude well below the level of rounding errors and thus cannot be used to determine the stagnation of convergence.
Computation of alpha and beta
In the algorithm, is chosen such that is orthogonal to. The denominator is simplified fromsince. The is chosen such that is conjugate to. Initially, is
using
and equivalently
the numerator of is rewritten as
because and are orthogonal by design. The denominator is rewritten as
using that the search directions are conjugated and again that the residuals are orthogonal. This gives the in the algorithm after cancelling.
Example code in [Julia (programming language)]
using LinearAlgebra
"""
x = conjugate_gradient; atol=length*eps
Return the solution to `A * x = b` using the conjugate gradient method.
`A` must be a positive definite matrix or other linear operator.
`x0` is the initial guess for the solution.
`atol` is the absolute tolerance on the magnitude of the residual `b - A * x`
for convergence.
Returns the approximate solution vector `x`.
"""
function conjugate_gradient; atol=length*eps
x = copy # initialize the solution
r = b - A * x0 # initial residual
p = copy # initial search direction
r²old = r' * r # squared norm of residual
k = 0
while r²old > atol^2 # iterate until convergence
Ap = A * p # search direction
α = r²old / # step size
@. x += α * p # update solution
# Update residual:
if % 16 0 # every 16 iterations, recompute residual from scratch
r.= b.- A * x # to avoid accumulation of numerical errors
else
@. r -= α * Ap # use the updating formula that saves one matrix-vector product
end
r²new = r' * r
@. p = r + * p # update search direction
r²old = r²new # update squared residual norm
k += 1
end
return x
end
Example code in [MATLAB]
function x = conjugate_gradient
% Return the solution to `A * x = b` using the conjugate gradient method.
% Reminder: A should be symmetric and positive definite.
if nargin < 4
tol = eps;
end
r = b - A * x0;
p = r;
rsold = r' * r;
x = x0;
while sqrt > tol
Ap = A * p;
alpha = rsold / ;
x = x + alpha * p;
r = r - alpha * Ap;
rsnew = r' * r;
p = r + * p;
rsold = rsnew;
end
end
Numerical example
Consider the linear system Ax = b given bywe will perform two steps of the conjugate gradient method beginning with the initial guess
in order to find an approximate solution to the system.
Solution
For reference, the exact solution isOur first step is to calculate the residual vector r0 associated with x0. This residual is computed from the formula r0 = b - Ax0, and in our case is equal to
Since this is the first iteration, we will use the residual vector r0 as our initial search direction p0; the method of selecting pk will change in further iterations.
We now compute the scalar using the relationship
We can now compute x1 using the formula
This result completes the first iteration, the result being an "improved" approximate solution to the system, x1. We may now move on and compute the next residual vector r1 using the formula
Our next step in the process is to compute the scalar that will eventually be used to determine the next search direction p1.
Now, using this scalar, we can compute the next search direction p1 using the relationship
We now compute the scalar using our newly acquired p1 using the same method as that used for.
Finally, we find x2 using the same method as that used to find x1.
The result, x2, is a "better" approximation to the system's solution than x1 and x0. If exact arithmetic were to be used in this example instead of limited-precision, then the exact solution would theoretically have been reached after n = 2 iterations.