After 5 wonderful years in the EnSuRe research group, Jeff graduated with a Ph.D. today and is moving to greener pastures as a post-doc. at Harvard. Good luck and congratulations Dr. Zhang !
Dissertation Title:
Towards Energy-Efficient and Reliable Deep Learning Inference [Slides]
Abstract:
Deep neural networks (DNN) have achieved or surpassed state-of-the-art results in a wide range of machine learning applications: from image, video and text classification, to speech recognition and natural language translation. Due to their growing popularity and high computational complexity, specialized hardware accelerators are being increasingly deployed to boost the performance and energy-efficiency of deep learning inference. Popular examples include Google’s Tensor Processing Unit (TPU), which utilizes a large systolic array (SA) at its core to speed up matrix multiplication or convolutions, and achieves a 30x-80x higher performance/watt than CPU or GPU based solutions. Energy-efficiency plays an important role in the successful deployment of deep learning in real-world scenarios, e.g., warehouse-scale data centers and power-constraint mobile devices. Moreover, safety-critical applications such as autonomous driving and smart healthcare systems require high resilience and availability, in addition to energy efficiency. For deep learning implemented in silicon, errors as a result of hardware faults or low voltage operation can accumulate and propagate all the way to the application-layer, and potentially cause misclassification.
This dissertation covers challenges and opportunities on energy-efficient and reliable deep learning inference. It focuses on the hardware accelerator design but also extends the scope to the inference in the clouds. This dissertation thoroughly studies the trade-offs among energy-efficiency, hardware/software faults, dynamic workloads, and DNN’s classification accuracy. More specifically, this dissertation presents: (1) ThunderVolt, a cross-layer framework that enables aggressive voltage underscaling of DNN accelerators without compromising performance and with >10% timing error resilience. (2) FATE, a methodology to quickly and accurately estimate the timing error rates and obtain a 50X speedup than the detailed gate-level timing simulation; (3) FAP, permanent fault analysis and mitigation techniques to address the design of fault-tolerant systolic array based DNNs under high defect rate with technology scaling; (4) CompAct, a pair of hardware encoder/decoder to enable aggressive on-chip compression of DNN activations that reduces >60% on-chip SRAM power while assuring the classification accuracy; (5) Model-Switching, a datacenter controller to improve the quality of service (QoS) for machine learning serving systems under fluctuating workloads.