Jeffrey Urban

I wrote this as a response to a question about how stochastic gradient descent (SGD) can outperform traditional gradient descent given its randomness.

My intuition is that we care more about making meaningful progress towards the optimum rather than taking the most precise path to get there, and SGD can more efficiently translate data into progress. Given the inaccuracy of the linear gradient approximation as we take a bigger step, a sequence of smaller SGD steps on a subset of the data can make just as much progress as we can confidently take a single large gradient descent step on the full dataset.

Gradient descent performs expensive data transfer and compute to determine the exact gradient to take one 'perfect' step in what is the right trajectory at the beginning of the step. But that step may not actually be optimal at the end, since the linear approximation becomes less representative over larger distances.

SGD accepts that each small step won't take the perfect trajectory, but the overall trajectory will still head the right way. Since SGD will tend to make progress during the sweep over the dataset, over a sweep, SGD will be calculating gradients from points that are, on average, closer to convergence than for that same sweep on gradient descent, which waits until a full sweep before making progress.

Another way we can look at it, is that from any point, calculating the trajectory for the next step is a tradeoff between data transfer and compute intensity vs inaccuracy. In determining our next step, as we consider more data points, the marginal improvement from each additional data point to improving our trajectory will decline, and the error of the linear gradient estimate vs the true function (normalized for step size vs calculated data points) will increase. If the error in our trajectory from calculating over a fraction of the dataset is close to the error from the linear approximation, then calculating over the remainder of the dataset is a waste. Mini-batch SGD allows us to tune this tradeoff via the batch size.

Away from the optimum, we care about progress, not the correctness of the path. The random noise of SGD is a benefit since it helps escape getting stuck in local minima. So when comparing SGD to gradient descent in terms of progress vs data transfer and compute, SGD may be able to achieve progress competitive with full batch gradient descent using less data and be less likely to get stuck at a sub-optimal location.

An Intuitive Explanation of SGD vs Gradient Descent