Misplaced Pages

Anderson acceleration

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson, this technique can be used to find the solution to fixed point equations f ( x ) = x {\displaystyle f(x)=x} often arising in the field of computational science.

Definition

Given a function f : R n R n {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} ^{n}} , consider the problem of finding a fixed point of f {\displaystyle f} , which is a solution to the equation f ( x ) = x {\displaystyle f(x)=x} . A classical approach to the problem is to employ a fixed-point iteration scheme; that is, given an initial guess x 0 {\displaystyle x_{0}} for the solution, to compute the sequence x i + 1 = f ( x i ) {\displaystyle x_{i+1}=f(x_{i})} until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function f {\displaystyle f} is computationally expensive. Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence.

Define the residual g ( x ) = f ( x ) x {\displaystyle g(x)=f(x)-x} , and denote f k = f ( x k ) {\displaystyle f_{k}=f(x_{k})} and g k = g ( x k ) {\displaystyle g_{k}=g(x_{k})} (where x k {\displaystyle x_{k}} corresponds to the sequence of iterates from the previous paragraph). Given an initial guess x 0 {\displaystyle x_{0}} and an integer parameter m 1 {\displaystyle m\geq 1} , the method can be formulated as follows:

x 1 = f ( x 0 ) {\displaystyle x_{1}=f(x_{0})}
k = 1 , 2 , {\displaystyle \forall k=1,2,\dots }
m k = min { m , k } {\displaystyle m_{k}=\min\{m,k\}}
G k = [ g k m k g k ] {\displaystyle G_{k}={\begin{bmatrix}g_{k-m_{k}}&\dots &g_{k}\end{bmatrix}}}
α k = argmin α A k G k α 2 , where A k = { α = ( α 0 , , α m k ) R m k + 1 : i = 0 m k α i = 1 } {\displaystyle \alpha _{k}=\operatorname {argmin} _{\alpha \in A_{k}}\|G_{k}\alpha \|_{2},\quad {\text{where}}\;A_{k}=\{\alpha =(\alpha _{0},\dots ,\alpha _{m_{k}})\in \mathbb {R} ^{m_{k}+1}:\sum _{i=0}^{m_{k}}\alpha _{i}=1\}}
x k + 1 = i = 0 m k ( α k ) i f k m k + i {\displaystyle x_{k+1}=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}}

where the matrix–vector multiplication G k α = i = 0 m k ( α ) i g k m k + i {\displaystyle G_{k}\alpha =\sum _{i=0}^{m_{k}}(\alpha )_{i}g_{k-m_{k}+i}} , and ( α ) i {\displaystyle (\alpha )_{i}} is the i {\displaystyle i} th element of α {\displaystyle \alpha } . Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when x k + 1 x k {\displaystyle \|x_{k+1}-x_{k}\|} falls under a prescribed tolerance, or when the residual g ( x k ) {\displaystyle g(x_{k})} falls under a prescribed tolerance.

With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence.

Derivation

For the solution x {\displaystyle x^{*}} , we know that f ( x ) = x {\displaystyle f(x^{*})=x^{*}} , which is equivalent to saying that g ( x ) = 0 {\displaystyle g(x^{*})={\vec {0}}} . We can therefore rephrase the problem as an optimization problem where we want to minimize g ( x ) 2 {\displaystyle \|g(x)\|_{2}} .

Instead of going directly from x k {\displaystyle x_{k}} to x k + 1 {\displaystyle x_{k+1}} by choosing x k + 1 = f ( x k ) {\displaystyle x_{k+1}=f(x_{k})} as in fixed-point iteration, let's consider an intermediate point x k + 1 {\displaystyle x'_{k+1}} that we choose to be the linear combination x k + 1 = X k α k {\displaystyle x'_{k+1}=X_{k}\alpha _{k}} , where the coefficient vector α k A k {\displaystyle \alpha _{k}\in A_{k}} , and X k = [ x k m k x k ] {\displaystyle X_{k}={\begin{bmatrix}x_{k-m_{k}}&\dots &x_{k}\end{bmatrix}}} is the matrix containing the last m k + 1 {\displaystyle m_{k}+1} points, and choose x k + 1 {\displaystyle x'_{k+1}} such that it minimizes g ( x k + 1 ) 2 {\displaystyle \|g(x'_{k+1})\|_{2}} . Since the elements in α k {\displaystyle \alpha _{k}} sum to one, we can make the first order approximation g ( X k α k ) = g ( i = 0 m k ( α k ) i x k m k + i ) i = 0 m k ( α k ) i g ( x k m k + i ) = G k α k {\displaystyle g(X_{k}\alpha _{k})=g\left(\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}\right)\approx \sum _{i=0}^{m_{k}}(\alpha _{k})_{i}g(x_{k-m_{k}+i})=G_{k}\alpha _{k}} , and our problem becomes to find the α {\displaystyle \alpha } that minimizes G k α 2 {\displaystyle \|G_{k}\alpha \|_{2}} . After having found α k {\displaystyle \alpha _{k}} , we could in principle calculate x k + 1 {\displaystyle x'_{k+1}} .

However, since f {\displaystyle f} is designed to bring a point closer to x {\displaystyle x^{*}} , f ( x k + 1 ) {\displaystyle f(x'_{k+1})} is probably closer to x {\displaystyle x^{*}} than what x k + 1 {\displaystyle x'_{k+1}} is, so it makes sense to choose x k + 1 = f ( x k + 1 ) {\displaystyle x_{k+1}=f(x'_{k+1})} rather than x k + 1 = x k + 1 {\displaystyle x_{k+1}=x'_{k+1}} . Furthermore, since the elements in α k {\displaystyle \alpha _{k}} sum to one, we can make the first order approximation f ( x k + 1 ) = f ( i = 0 m k ( α k ) i x k m k + i ) i = 0 m k ( α k ) i f ( x k m k + i ) = i = 0 m k ( α k ) i f k m k + i {\displaystyle f(x'_{k+1})=f\left(\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}\right)\approx \sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f(x_{k-m_{k}+i})=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}} . We therefore choose

x k + 1 = i = 0 m k ( α k ) i f k m k + i {\displaystyle x_{k+1}=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}} .

Solution of the minimization problem

At each iteration of the algorithm, the constrained optimization problem argmin G k α 2 {\displaystyle \operatorname {argmin} \|G_{k}\alpha \|_{2}} , subject to α A k {\displaystyle \alpha \in A_{k}} needs to be solved. The problem can be recast in several equivalent formulations, yielding different solution methods which may result in a more convenient implementation:

  • defining the matrices G k = [ g k m k + 1 g k m k g k g k 1 ] {\displaystyle {\mathcal {G}}_{k}={\begin{bmatrix}g_{k-m_{k}+1}-g_{k-m_{k}}&\dots &g_{k}-g_{k-1}\end{bmatrix}}} and X k = [ x k m k + 1 x k m k x k x k 1 ] {\displaystyle {\mathcal {X}}_{k}={\begin{bmatrix}x_{k-m_{k}+1}-x_{k-m_{k}}&\dots &x_{k}-x_{k-1}\end{bmatrix}}} , solve γ k = argmin γ R m k g k G k γ 2 {\displaystyle \gamma _{k}=\operatorname {argmin} _{\gamma \in \mathbb {R} ^{m_{k}}}\|g_{k}-{\mathcal {G}}_{k}\gamma \|_{2}} , and set x k + 1 = x k + g k ( X k + G k ) γ k {\displaystyle x_{k+1}=x_{k}+g_{k}-({\mathcal {X}}_{k}+{\mathcal {G}}_{k})\gamma _{k}} ;
  • solve θ k = { ( θ k ) i } i = 1 m k = argmin θ R m k g k + i = 1 m k θ i ( g k i g k ) 2 {\displaystyle \theta _{k}=\{(\theta _{k})_{i}\}_{i=1}^{m_{k}}=\operatorname {argmin} _{\theta \in \mathbb {R} ^{m_{k}}}\left\|g_{k}+\sum _{i=1}^{m_{k}}\theta _{i}(g_{k-i}-g_{k})\right\|_{2}} , then set x k + 1 = x k + g k + j = 1 m k ( θ k ) j ( x k j x k + g k j g k ) {\displaystyle x_{k+1}=x_{k}+g_{k}+\sum _{j=1}^{m_{k}}(\theta _{k})_{j}(x_{k-j}-x_{k}+g_{k-j}-g_{k})} .

For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition and singular value decomposition, possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost.

Stagnation in the method (i.e. subsequent iterations with the same value, x k + 1 = x k {\displaystyle x_{k+1}=x_{k}} ) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation ( x k + 1 x k {\displaystyle x_{k+1}\approx x_{k}} ) results in bad conditioning of the least squares problem. Moreover, the choice of the parameter m {\displaystyle m} might be relevant in determining the conditioning of the least-squares problem, as discussed below.

Relaxation

The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) β k > 0 {\displaystyle \beta _{k}>0} . At each step, compute the new iterate as x k + 1 = ( 1 β k ) i = 0 m k ( α k ) i x k m k + i + β k i = 0 m k ( α k ) i f ( x k m k + i ) . {\displaystyle x_{k+1}=(1-\beta _{k})\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}+\beta _{k}\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f(x_{k-m_{k}+i})\;.} The choice of β k {\displaystyle \beta _{k}} is crucial to the convergence properties of the method; in principle, β k {\displaystyle \beta _{k}} might vary at each iteration, although it is often chosen to be constant.

Choice of m

The parameter m {\displaystyle m} determines how much information from previous iterations is used to compute the new iteration x k + 1 {\displaystyle x_{k+1}} . On the one hand, if m {\displaystyle m} is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if m {\displaystyle m} is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow. Moreover, the choice of m {\displaystyle m} affects the size of the optimization problem. A too large value of m {\displaystyle m} may worsen the conditioning of the least squares problem and the cost of its solution. In general, the particular problem to be solved determines the best choice of the m {\displaystyle m} parameter.

Choice of mk

With respect to the algorithm described above, the choice of m k {\displaystyle m_{k}} at each iteration can be modified. One possibility is to choose m k = k {\displaystyle m_{k}=k} for each iteration k {\displaystyle k} (sometimes referred to as Anderson acceleration without truncation). This way, every new iteration x k + 1 {\displaystyle x_{k+1}} is computed using all the previously computed iterations. A more sophisticated technique is based on choosing m k {\displaystyle m_{k}} so as to maintain a small enough conditioning for the least-squares problem.

Relations to other classes of methods

Newton's method can be applied to the solution of f ( x ) x = 0 {\displaystyle f(x)-x=0} to compute a fixed point of f ( x ) {\displaystyle f(x)} with quadratic convergence. However, such method requires the evaluation of the exact derivative of f ( x ) {\displaystyle f(x)} , which can be very costly. Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of f ( x ) {\displaystyle f(x)} at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function f ( x ) {\displaystyle f(x)} per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general.

Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:

  • Eyert and Fang and Saad interpreted the algorithm within the class of quasi-Newton and multisecant methods, that generalize the well known secant method, for the solution of the non-linear equation g ( x ) = 0 {\displaystyle g(x)=0} ; they also showed how the scheme can be seen as a method in the Broyden class;
  • Walker and Ni showed that the Anderson acceleration scheme is equivalent to the GMRES method in the case of linear problems (i.e. the problem of finding a solution to A x = x {\displaystyle A\mathbf {x} =\mathbf {x} } for some square matrix A {\displaystyle A} ), and can thus be seen as a generalization of GMRES to the non-linear case; a similar result was found by Washio and Oosterlee.

Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors, although most often in the context of some specific application of interest rather than as a general method for fixed point equations.

Example MATLAB implementation

The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function f ( x ) = sin ( x ) + arctan ( x ) {\displaystyle f(x)=\sin(x)+\arctan(x)} . Notice that:

  • the optimization problem was solved in the form γ k = argmin γ R m k g k G k γ 2 {\displaystyle \gamma _{k}=\operatorname {argmin} _{\gamma \in \mathbb {R} ^{m_{k}}}\|g_{k}-{\mathcal {G}}_{k}\gamma \|_{2}} using QR decomposition;
  • the computation of the QR decomposition is sub-optimal: indeed, at each iteration a single column is added to the matrix G k {\displaystyle {\mathcal {G}}_{k}} , and possibly a single column is removed; this fact can be exploited to efficiently update the QR decomposition with less computational effort;
  • the algorithm can be made more memory-efficient by storing only the latest few iterations and residuals, if the whole vector of iterations x k {\displaystyle x_{k}} is not needed;
  • the code is straightforwardly generalized to the case of a vector-valued f ( x ) {\displaystyle f(x)} .
f = @(x) sin(x) + atan(x); % Function whose fixed point is to be computed.
x0 = 1; % Initial guess.
k_max = 100; % Maximum number of iterations.
tol_res = 1e-6; % Tolerance on the residual.
m = 3; % Parameter m.
x = ; % Vector of iterates x.
g = f(x) - x; % Vector of residuals.
G_k = g(2) - g(1); % Matrix of increments in residuals.
X_k = x(2) - x(1); % Matrix of increments in x.
k = 2;
while k < k_max && abs(g(k)) > tol_res
    m_k = min(k, m);
    % Solve the optimization problem by QR decomposition.
     = qr(G_k);
    gamma_k = R \ (Q' * g(k));
    % Compute new iterate and new residual.
    x(k + 1) = x(k) + g(k) - (X_k + G_k) * gamma_k;
    g(k + 1) = f(x(k + 1)) - x(k + 1);
    % Update increment matrices with new elements.
    X_k = ;
    G_k = ;
    n = size(X_k, 2);
    if n > m_k
        X_k = X_k(:, n - m_k + 1:end);
        G_k = G_k(:, n - m_k + 1:end);
    end
    k = k + 1;
end
% Prints result: Computed fixed point 2.013444 after 9 iterations
fprintf("Computed fixed point %f after %d iterations\n", x(end), k);

See also

Notes

  1. This formulation is not the same as given by the original author; it is an equivalent, more explicit formulation given by Walker and Ni.

References

  1. ^ Anderson, Donald G. (October 1965). "Iterative Procedures for Nonlinear Integral Equations". Journal of the ACM. 12 (4): 547–560. doi:10.1145/321296.321305.
  2. ^ Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto (30 November 2010). Numerical mathematics (2nd ed.). Springer. ISBN 978-3-540-49809-4.
  3. ^ Walker, Homer F.; Ni, Peng (January 2011). "Anderson Acceleration for Fixed-Point Iterations". SIAM Journal on Numerical Analysis. 49 (4): 1715–1735. CiteSeerX 10.1.1.722.2636. doi:10.1137/10078356X.
  4. ^ Fang, Haw-ren; Saad, Yousef (March 2009). "Two classes of multisecant methods for nonlinear acceleration". Numerical Linear Algebra with Applications. 16 (3): 197–221. doi:10.1002/nla.617.
  5. Evans, Claire; Pollock, Sara; Rebholz, Leo G.; Xiao, Mengying (20 February 2020). "A Proof That Anderson Acceleration Improves the Convergence Rate in Linearly Converging Fixed-Point Methods (But Not in Those Converging Quadratically)". SIAM Journal on Numerical Analysis. 58 (1): 788–810. arXiv:1810.08455. doi:10.1137/19M1245384.
  6. Eyert, V. (March 1996). "A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences". Journal of Computational Physics. 124 (2): 271–285. doi:10.1006/jcph.1996.0059.
  7. Broyden, C. G. (1965). "A class of methods for solving nonlinear simultaneous equations". Mathematics of Computation. 19 (92): 577–593. doi:10.1090/S0025-5718-1965-0198670-6.
  8. Ni, Peng (November 2009). Anderson Acceleration of Fixed-point Iteration with Applications to Electronic Structure Computations (PhD).
  9. ^ Oosterlee, C. W.; Washio, T. (January 2000). "Krylov Subspace Acceleration of Nonlinear Multigrid with Application to Recirculating Flows". SIAM Journal on Scientific Computing. 21 (5): 1670–1690. doi:10.1137/S1064827598338093.
  10. Pulay, Péter (July 1980). "Convergence acceleration of iterative sequences. the case of scf iteration". Chemical Physics Letters. 73 (2): 393–398. doi:10.1016/0009-2614(80)80396-4.
  11. Pulay, P. (1982). "ImprovedSCF convergence acceleration". Journal of Computational Chemistry. 3 (4): 556–560. doi:10.1002/jcc.540030413.
  12. Carlson, Neil N.; Miller, Keith (May 1998). "Design and Application of a Gradient-Weighted Moving Finite Element Code I: in One Dimension". SIAM Journal on Scientific Computing. 19 (3): 728–765. doi:10.1137/S106482759426955X.
  13. Miller, Keith (November 2005). "Nonlinear Krylov and moving nodes in the method of lines". Journal of Computational and Applied Mathematics. 183 (2): 275–287. doi:10.1016/j.cam.2004.12.032.
  14. Daniel, J. W.; Gragg, W. B.; Kaufman, L.; Stewart, G. W. (October 1976). "Reorthogonalization and stable algorithms for updating the Gram-Schmidt $QR$ factorization". Mathematics of Computation. 30 (136): 772. doi:10.1090/S0025-5718-1976-0431641-8.
Categories:
Anderson acceleration Add topic