Computer Vision News - January 2020
3 Summary Neural NTK 15 The model that we are interested in is the evolution of the network under gradient flow i.e. gradient decent with infinitesimal small learning rate. In this case, if we denote the network output at time and the vector of concatenation of the output across all example , then it is not hard to verify by the derivative definition that: where The main theoretical concept is that when the network is initialized with i.i.d normal distribution, the matrix H(t) remains very close during the training to a different matrix denoted by H*. This matrix is independent of the time and it has a closed form solution of the form i.e. the expectation of the above inner product over random initialized weights. The magic happens when we replace above H(t) by H*: in this case, the training can be described as a linear system that outputs a linear model which has the form of The main contribution of the paper is to show that when the width is large enough, H(t) is indeed close to H* and the paper gives an explicit bound on the distance of from the original neural network. The authors also provided a closed form solution for the H* above and supplied a dynamic programming algorithm to compute it. At the end, the paper establishes a formula to compute H* ij which is a univariate recursive function that depends only on x i T x j and that can be computed efficiently (further details and explanations are in the paper). The bottom line of the theoretical analysis of the paper is the following: instead of training a network, choosing the learning rate, and feedforwarding any new example for the network, we only need to do two things: compute H* by the recursive formula and define the (trained) network by: Theoretical Analysis ( ) = ( , ) ( ) = ( )( ( ) − ) ( ) =< , > ∗ = (< , >) =: ker( , ) ( ) ( ) = (ker( , 1 ) , . . , ker( , ) ∗ −1 ( ) ( ) = (ker( , 1 ) , . . , ker( , ) ∗ −1
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=