Tyler
Johnson

Overview

I work as an ML Research Engineer at Apple in Seattle, WA. I am an expert in scalable training algorithms for machine learning, including distributed algorithms for deep learning.

In 2018, I completed my PhD at the University of Washington, where my advisor was Carlos Guestrin. Before that, I attended the University of Michigan.

Publications

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, and Carlos Guestrin.

ICML, 2020 (To appear).

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale's convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale's qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.
@inproceedings{johnson2020adascale,
author = {Tyler B. Johnson and Pulkit Agrawal and Haijie Gu and Carlos Guestrin},
title = {AdaScale SGD: A User-Friendly Algorithm for Distributed Training},
booktitle = {Proceedings of the 37th International Conference on Machine Learning},
year = {2020}
}

Scaling Machine Learning via Prioritized Optimization

PhD Dissertation, University of Washington, 2018.

To learn from large datasets, modern machine learning applications rely on scalable training algorithms. Typically such algorithms employ stochastic updates, parallelism, or both. This work develops scalable algorithms via a third approach: prioritized optimization.

We first propose a method for prioritizing challenging tasks when training deep models. Our robust approximate importance sampling procedure (RAIS) speeds up stochastic gradient descent by sampling minibatches non-uniformly. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with little overhead and minimal hyperparameters.

In the second part of this work, we develop strategies for prioritizing optimization when solving convex problems with piecewise linear structure. Our BlitzWS working set algorithm offers unique theoretical guarantees and solves several classic machine learning problems very efficiently in practice. We also propose a closely related safe screening test, BlitzScreen, which is state-of-the-art for safe screening in multiple ways.

Our final contribution is a “stingy update” rule for coordinate descent. Our StingyCD algorithm prioritizes optimization variables by eliminating provably useless computation. StingyCD requires only simple changes to CD and results in significant speed-ups in practice.

@phdthesis{johnson2018dissertation,
author = {Tyler B. Johnson},
title = {Scaling Machine Learning via Prioritized Optimization},
school = {University of Washington},
year = {2018}
}

Training Deep Models Faster with Robust, Approximate Importance Sampling.

Tyler B. Johnson and Carlos Guestrin.

NeurIPS, 2018.

In theory, importance sampling speeds up stochastic gradient algorithms for supervised learning by prioritizing training instances. In practice, the cost of computing importances greatly limits the impact of importance sampling. We propose a robust, approximate importance sampling procedure (RAIS) for stochastic gradient descent. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with drastically reduced overhead. Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes more than 2x in our comparisons. RAIS requires only simple changes to SGD, and the procedure depends minimally on hyperparameters.
@inproceedings{johnson2018rais,
author = {Tyler B. Johnson and Carlos Guestrin},
title = {Training Deep Models Faster with Robust, Approximate Importance Sampling},
booktitle = {Advances in Neural Information Processing Systems 31},
year = {2018}
}

A Fast, Principled Working Set Algorithm for Exploiting Piecewise Linear Structure in Convex Problems.

Tyler B. Johnson and Carlos Guestrin.

arXiv:1807.08046, 2018.

By reducing optimization to a sequence of smaller subproblems, working set algorithms achieve fast convergence times for many machine learning problems. Despite such performance, working set implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose BlitzWS, a working set algorithm with useful theoretical guarantees. Our theory relates subproblem size and stopping criteria to the amount of progress during each iteration. This result motivates strategies for optimizing algorithmic parameters and discarding irrelevant components as BlitzWS progresses toward a solution. BlitzWS applies to many convex problems, including training L1-regularized models and support vector machines. We showcase this versatility with empirical comparisons, which demonstrate BlitzWS is indeed a fast algorithm.
@inproceedings{johnson2018blitzws,
author = {Tyler B. Johnson and Carlos Guestrin},
title = {A Fast, Principled Working Set Algorithm for Exploiting Piecewise Linear Structure in Convex Problems},
howpublished = {arXiv:1807.08046},
year = {2018}
}

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent.

Tyler B. Johnson and Carlos Guestrin.

ICML, 2017.

Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems in machine learning. Despite this fact, CD can also be very computationally wasteful. Due to sparsity in sparse regression problems, for example, often the majority of CD updates result in no progress toward the solution. To address this inefficiency, we propose a modified CD algorithm named "StingyCD." By skipping over many updates that are guaranteed to not decrease the objective value, StingyCD significantly reduces convergence times. Since StingyCD only skips updates with this guarantee, however, StingyCD does not fully exploit the problem's sparsity. For this reason, we also propose StingyCD+, an algorithm that achieves further speed-ups by skipping updates more aggressively. Since StingyCD and StingyCD+ rely on simple modifications to CD, it is also straightforward to use these algorithms with other approaches to scaling optimization. In empirical comparisons, StingyCD and StingyCD+ improve convergence times considerably for L1-regularized optimization problems.
@inproceedings{johnson2017stingycd,
author = {Tyler B. Johnson and Carlos Guestrin},
title = {StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent},
booktitle = {Proceedings of the 34th International Conference on Machine Learning},
year = {2017}
}

Unified Methods for Exploiting Piecewise Linear Structure in Convex Optimization.

Tyler B. Johnson and Carlos Guestrin.

NIPS, 2016.

We develop methods for rapidly identifying important components of a convex optimization problem for the purpose of achieving fast convergence times. By considering a novel problem formulation—the minimization of a sum of piecewise functions—we describe a principled and general mechanism for exploiting piecewise linear structure in convex optimization. This result leads to a theoretically justified working set algorithm and a novel screening test, which generalize and improve upon many prior results on exploiting structure in convex optimization. In empirical comparisons, we study the scalability of our methods. We find that screening scales surprisingly poorly with the size of the problem, while our working set algorithm convincingly outperforms alternative approaches.
@inproceedings{johnson2016piecewise,
author = {Tyler B. Johnson and Carlos Guestrin},
title = {Unified Methods for Exploiting Piecewise Linear Structure in Convex Optimization},
booktitle = {Advances in Neural Information Processing Systems 29},
year = {2016}
}

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization.

Tyler B. Johnson and Carlos Guestrin.

ICML, 2015.

By reducing optimization to a sequence of small subproblems, working set methods achieve fast convergence times for many challenging problems. Despite excellent performance, theoretical understanding of working sets is limited, and implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose Blitz, a fast working set algorithm accompanied by useful guarantees. Making no assumptions on data, our theory relates subproblem size to progress toward convergence. This result motivates methods for optimizing algorithmic parameters and discarding irrelevant variables as iterations progress. Applied to L1-regularized learning, Blitz convincingly outperforms existing solvers in sequential, limited-memory, and distributed settings. Blitz is not specific to L1-regularized learning, making the algorithm relevant to many applications involving sparsity or constraints.
@inproceedings{johnson2015blitz,
author = {Tyler B. Johnson and Carlos Guestrin},
title = {Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning},
year = {2015}
}

Teaching

I have enjoyed helping teach the following courses: