Natural Gradient

In many problems dealing with parameter spaces, it is not always sufficient or efficient enough to use the conventional gradient of a space. In many optimization problems, such as supervised learning and source separation, it is more efficient to use the natural gradient when implementing the learning rule.

A recurring problem in optimization is the minimization of a quantity such as the cost function or mutual information. The strategy is to use the gradient of the parameter space to locate the overall minimum. This strategy may often be suboptimal, or oscillate near local minima when the conventional gradient $$ \nabla = ( \frac{\partial}{\partial w_0 }, \frac{\partial}{\partial w_1} \ldots \frac{\partial}{\partial w_n})$$ is taken into account.

A substantial improvement over this problem is to use the natural gradient of the parameter space. The natural gradient is the gradient that represents the steepest direction of the function. This is accomplished by considering a different metric tensor for the parameter space of the problem.

In conventional Euclidean space, we usually define orthonormal coordinates, such that the length of a small vector $$dw$$ is $$|d\vec{w}|^2 = \sum_i (dw_i)^2 $$ where $$dw_i$$ are the components of w i.e. the projection to each axon. This is because, in Euclidean space, the metric tensor $$ g_{ij} = \delta_{ij}$$, gives rise to the familiar inner product formula

$$ \vec{a} \vec{b} = \sum_i \sum_j A_i B_j \delta_{ij} = \sum_i A_i B_i$$,

and length formula $$ |w|^2 = \sum_i (w_i)^2 $$ where i = 1,2,3.

In a more general (possibly curved) n-dimensional Riemannian space, the metric tensor differs. Usually, we consider a parameter (vector) n-dimensional space $$S = \{ w \in R^n \} $$ where, for example, w may be the weight vectors of a neural network. For Euclidean spaces, the length of a small increment in the space of w would be given by the formula mentioned above. For a curved manifold S, though, the length formula is:

$$ |dw|^2 = \sum_{i,j} g_{ij} dw_i dw_j $$, i,j ranging from 0 to n, where G = gij is the Riemannian metric tensor.

Suppose we have also defined a function in S, L(w) that we want to minimize (assume, for example that L(w) is the error function of a neural network model). In order to minimize the function efficiently we need to follow the steepest slope of the function to reach faster to the minimum. Suppose we move by a small increment of square length $$|dw|^2$$. The steepest descent direction is the one that minimizes $$L(w+dw)$$.

It can be proved [1] that the steepest descent gradient is given by

$$- \tilde{\nabla} L(w) = -G^{-1}(w) \nabla L(w)$$ where $$G^{-1} = g^{ji}$$ is the inverse of $$G$$, the metric (which also happens to be the transpose of G).