Cloud Computing Seminar - 2023 Fall Semester (Jianfeng Gu)

TOPIC #1: Deploy ChatGPT: Enabling Efficient Inference Serving for LLM (Large Language Model) in the Cloud.

Requirements: basic understanding of deep learning platform, eg. Pytorch, Tensorflow, etc. (select 1 paper below)

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU;

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, et al. (https://arxiv.org/abs/2303.06865)
Fast Distributed Inference Serving for Large Language Models; ; Bingyang Wu∗ Yinmin Zhong∗ Zili Zhang∗ Gang Huang Xuanzhe Liu Xin Jin.

(https://arxiv.org/pdf/2305.05920.pdf)
Tabi: An Efficient Multi-Level Inference System for Large Language Models;

*Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An Efficient Multi-Level Inference System for Large Language Models. In Proceedings of the Eighteenth European Conference on Computer Systems (EuroSys '23). Association for Computing Machinery, New York, NY, USA, 233–248. https://doi.org/10.1145/3552326.3587438*
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning; *Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, Article 59, 1–14. https://doi.org/10.1145/3458817.3476205*

TOPIC #2: IceCrusher: Alleviating Cold-Start for Serverless Function

Requirement: hands-on experience with containers, eg. docker, or containerd, or microvm, or runc, etc. or familiar with related technologies; (select 1 paper below)

IceBreaker: warming serverless functions better with heterogeneity; *Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari. 2022. IceBreaker: warming serverless functions better with heterogeneity. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). Association for Computing Machinery, New York, NY, USA, 753–767. https://doi.org/10.1145/3503222.3507750*
FaasCache: keeping serverless computing alive with greedy-dual caching;

Alexander Fuerst and Prateek Sharma. 2021. FaasCache: keeping serverless computing alive with greedy-dual caching. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 386–400. https://doi.org/10.1145/3445814.3446757* Code:* https://github.com/djobiii2078/FAASCACHE
Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting; *Dong Du, Tianyi Yu, Yubin Xia, Binyu Zang, Guanglu Yan, Chenggang Qin, Qixuan Wu, and Haibo Chen. 2020. Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 467–481. https://doi.org/10.1145/3373376.3378512*
Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through Inter-Function Container Sharing;

Li, Zijun, et al. "Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through {Inter-Function} Container Sharing." 2022 USENIX Annual Technical Conference (USENIX ATC 22). 2022.

TOPIC #3: Awesome FaaS-Inference: Optimizing Serverless Computing for Deep Learning Inference

Requirement: Basic knowledge of serverless computing and deep learning inference framework, eg. Pytorch, Tensorflow, etc. (select 1 paper below)

FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping Yu, Minchen, et al. "FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping." arXiv preprint arXiv:2306.03622 (2023).APA (https://arxiv.org/abs/2306.03622)
Optimizing Inference Serving on Serverless Platforms

*Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2022. Optimizing inference serving on serverless platforms. Proc. VLDB Endow. 15, 10 (June 2022), 2071–2084. https://doi.org/10.14778/3547305.3547313*
Tetris: Memory-efficient Serverless Inference through Tensor Sharing Li, Jie, et al. "Tetris: Memory-efficient Serverless Inference through Tensor Sharing." 2022 USENIX Annual Technical Conference (USENIX ATC 22). 2022.