Project Description

Parallax: Auto-Parallelizing Machine Learning Model Training
The employment of high-performance servers and GPU accelerators for training neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, e.g. TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, efficiently utilizing environments of multiple machines and GPUs is still not a straightforward task for researchers because they must apply nontrivial modifications to their single-device programs, including parallel execution mechanisms and device coordination.

We introduce Parallax, a tool for automatic parallelization of neural network training in distributed environments. Parallax employs data parallelism for distributed training, utilizing either parameter server architecture or MPI-style collective communication primitives for synchronization. Parallax leverages various optimizations to minimize the communication overhead caused by scaling out, such as intermediate data aggregation to reduce inter-machine communication time and manipulating operation schedule to overlap communication with computation.

New Deep Learning Abstraction: SubGraph and InvokeOp
A group of recent deep learning applications involves deep neural networks with dynamically changing architectures. However, popular frameworks, including TensorFlow, Caffe2, MXNet, and CNTK, are not designed to support such advanced structures natively, incurring substantial overhead to change the architecture of a given network on the fly. An important network architecture type is recursive neural networks. They have widely been used by researchers to handle applications with recursively or hierarchically structured data. However, embedded control flow deep learning frameworks such as TensorFlow, Theano, Caffe2, and MXNet fail to efficiently represent and execute such neural networks, due to lack of support for recursion. In this paper, we add recursion to the programming model of existing frameworks by complementing their design with recursive execution of data ow graphs as well as additional APIs for recursive definitions. Unlike iterative implementations, which can only understand the topological index of each node in recursive data structures, our recursive implementation is able to exploit the recursive relationships between nodes for efficient execution based on parallel computation. We present an implementation on TensorFlow and evaluation results with various recursive neural network models, showing that our recursive implementation not only conveys the recursive nature of recursive neural networks better than other implementations, but also uses given resources more effectively to reduce training and inference time.

Pretzel: High-Performance Serverless Prediction Serving
Machine Learning models are often composed of sequences of transformations. While this design makes it easy to decompose and efficiently execute single model components at training time, predictions require low latency and high-performance predictability whereby end-to-end and multi-model runtime optimizations are needed to meet such goals. We shed some light on the problem by introducing a new system design for high-performance prediction serving. We report some results showing how our system design is able to improve performance over several dimensions with respect to current state-of-the-art approaches.

Auto-tuning Machine Learning Configuration
Recent researches in machine learning systems have proved that the performance of machine learning jobs is heavily dependent on the configurations. The examples of configurable knobs in machine learning systems are resource allocation, workload distribution, algorithm selection and hyper-parameters. We aim to build a system that automatically finds the best configuration for a machine learning job by applying a broad range of optimization techniques.

Eunji Jeong, Joo Seong Jeong, Soojeong Kim, Gyeong-In Yu, Byung-Gon Chun. Improving the Expressiveness of Deep Learning Frameworks with Recursion, To appear at EuroSys 2018.
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Byung-Gon Chun. Towards High-Performance Prediction Serving Systems. SysML Conference, February 2018.
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Byung-Gon Chun. Towards High-Performance Prediction Serving Systems. ML Systems Workshop at NIPS 2017, December 2017.
Soojeong Kim, Eunji Jeong, Joo Seong Jeong, Gyeongin Yu, Hojin Park, Byung-Gon Chun. Auto-Parallelizing Deep Learning for Multi-machine, Multi-GPU Environments. Workshop on AI Systems at Symposium on Operating Systems Principles (SOSP), October 2017.
Byung-Gon Chun, Brian Cho, Beomyeol Jeon, Joo Seong Jeong, Gunhee Kim, Joo Yeon Kim, Woo-Yeon Lee, Yun Seong Lee, Markus Weimer, Youngseok Yang, Gyeong-In Yu. Dolphin: Runtime Optimization for Distributed Machine Learning, ICML ML Systems workshop, June 2016.
Joo Seong Jeong, Woo-Yeon Lee, Yunseong Lee, Youngseok Yang, Brian Cho, Byung-Gon Chun. Elastic Memory: Bring Elasticity Back To In-Memory Big Data Analytics. USENIX HotOS Workshop, May 2015.