Particle gradient flowsMany tasks in machine learning and signal processing require to minimize a convex function of a measure: \[ \min_{\mu \in \mathcal{M}(\Theta)} J(\mu) \] where \(\mathcal{M}(\Theta)\) is the space of measures on a manifold \(\Theta\). We focus on the case where \(J\) is (weakly) continuous and (Fréchet) differentiable. For this optimization problem, we investigate the method that consists in discretizing the measure into particles \(\mu = \sum_{i=1}^m w_i \delta_{\theta_i}\) and taking the gradient flow of \(J\) in the parameters \((w_i,\theta_i)_i\). We show that, quite generically, this gradient flow converges to global minimizers in the many-particle limit \(m \to \infty\), if the initial distribution of particles is satisfies a “separation” property. Preprint
Lenaic Chizat, Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. 2018. <hal-01798792> AnimationsWe display below animated versions of the particle gradient flows shown in Section 4 of the preprint, as well as a particle gradient flow corresponding to the training of a neural network with a single hidden layer and sigmoid activation function in dimension d=2. In all these cases, the global minimizer (supported on the dotted lines in the plots) is found. The time scale is logarithmic (see paper for details). Sparse spikes deconvolution Neural networks with a single hidden layer We train a simple neural network on a synthetic well-specified model with SGD. |