ETH Zürich & Microsoft study: Demystifying serverless ML training
Serverless computing is a new type of cloud-based compute infrastructure originally developed for web microservices and IoT applications. As it frees model developers from concerns about capacity planning, configuration, management, maintenance, operation and scaling of containers, virtual machines and physical servers, computing without server has grown in popularity with machine learning (ML) researchers in recent years.
Additionally, the benefits of serverless computing have also sparked interest in its adoption for data-intensive workloads such as ETL (extract, transform, load), query processing, and ML, where it can significantly reduce costs. Riding this trend, a research team from ETH Zurich and Microsoft recently conducted a systematic comparative study of distributed ML training on serverless infrastructures (FaaS) and serverful infrastructures (IaaS), with the aim of ” identify and understand the system tradeoffs involved in distributed ML. training with serverless infrastructures.
Serverless computing is offered by leading cloud service providers such as AWS Lambda, Azure Functions, and Google Cloud Functions. Although researchers are increasingly choosing FaaS for ML inference, it remains unclear whether FaaS is a good choice for ML training. This “training platform as a service” paradigm is appealing to both industry and academia, and AWS now offers serverless ML training in AWS Lambda using the SageMaker and AutoGluon platforms.
The paper Towards the demystification of serverless machine learning training begs the question: When can a serverless infrastructure (FaaS) outperform a VM-based “server” (IaaS) infrastructure for distributed ML training?
The team summarizes their contributions as follows:
- Systematically explore the algorithm choice and system design for FaaS and IaaS ML training strategies and describe the trade-off between a diverse range of ML models, training workloads, and infrastructure choices.
- Develop an analytical model that characterizes the trade-off between FaaS and IaaS training, and use it to speculate on the performance of potential configurations used by future systems.
The team uses LambdaML – a FaaS-based ML system prototype built on Amazon Lambda – to investigate the tradeoffs involved in forming ML models on serverless infrastructures. With this approach, a user specifies training configurations such as data location, resources, optimization algorithm, and hyperparameters in the AWS web UI. AWS then submits the job to a serverless infrastructure that allocates resources based on user demand. Training data is partitioned and stored in AWS S3, and each “worker” (running instance) maintains a copy of the local model and uses the LambdaML library to train an ML model. The LambdaML training pipeline has five stages: load data, calculate statistics, send statistics, aggregate statistics, and update the model.
Researchers are exploring four major aspects of LambdaML implementation: the distributed optimization algorithm, communication channels, communication models, and synchronization protocols. They focus on two distributed optimization algorithms, Distributed Stochastic Gradient Descent (SGD) and Distributed Alternate Direction Method of Multipliers (ADMM), and use a storage service such as S3 or ElastiCache as a communication channel. The team uses AllReduce and ScatterReduce as communication models and designs a two-phase synchronous protocol that includes a merge and update phase.
LambdaML performance was evaluated by comparing the different design options referenced above using the Higgs, RCV1, and Cifar10 datasets. The team implemented GA-SGD (SGD with gradient mean), MA-SGD (SGD with model mean), and ADMM in addition to LambdaML, using ElastiCache for Memcached as the external storage service.
From the empirical results, the team concluded that FaaS can be faster than IaaS, but only under a specific regime, i.e. when the underlying workload can be made efficient in terms of convergence and quantity. of data communicated. They also note that while FaaS is much faster, it’s not that much cheaper. A valid idea in all scenarios is that even when FaaS is faster than IaaS, it is generally comparably priced.
Overall, the results confirm that LambdaML provides a fair comparison between FaaS and IaaS systems, taking a significant step forward in demystifying serverless ML.
The paper Towards the demystification of serverless machine learning training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research findings. Subscribe to our popular newsletter Global AI synchronized weekly to get weekly AI updates.