Dean's Seminar Series: Practical Reliability Analysis of GPGPUs in the Wild: from Systems to Applications
- 24 Feb 2020 1:00 pm - 2:00 pm
- Seminar Room G12A, Clayton with VC to room B461, Building B at Caulfield
- Open to:
- IT research seminars
Speaker: Professor Evgenia Smirni
General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.
Evgenia Smirni received the Diploma degree in Computer Science and Informatics from the University of Patras, Greece, in 1987 and the Ph.D. degree in Computer Science from Vanderbilt University in 1995. She is the Sidney P. Chockley Professor of Computer Science at the College of William and Mary, Williamsburg, VA, USA. Her research interests include queuing networks, stochastic modeling, Markov chains, resource allocation policies, storage systems, data centers and cloud computing, workload characterization, models for performance prediction, and reliability of distributed systems and applications. She has served as the Program co-Chair of QEST’05, ACM Sigmetrics/Performance’06, HotMetrics’10, ICPE’17, DSN’17, SRDS’19, and HPDC'19. She also served as the General co-Chair of QEST’10 and NSMC’10. She is an IEEE Fellow, an ACM Distinguished Scientist, and a member of IFIP W.G. 7.3.
Host: Dr Jiangshan Yu
- Dr Jiangshan Yu