Efficient Hardware for Deep Learning
Training and executing deep neural networks is computationally demanding. For this reason, leading companies are designing specialized chips to accelerate deep learning workloads. Our work explores new circuit and architectural optimizations to increase the performance, reliability and energy-efficient of deep learning hardware.
Several leading companies have recently released specialized chips tailored for deep learning; Google’s Tensor Processing Unit (TPU), for instance, accelerates deep neural network (DNN) inference (and training) using a systolic array, a precisely timed array containing thousands of multiply-and-accumulate (MAC) units. At the EnSuRe group, we are pursuing cutting edge research on designing more energy-efficient and reliable hardware accelerators for ML.
We propose ThunderVolt, a systolic array based DNN accelerator that uses voltage underscaling to achieve 2× or more reduction in power/energy without compromising classification accuracy. ThunderVolt is based on the observation that conventional detect-and-re-execute based timing speculation does not work for systolic arrays; simply re-executing an erroneous MAC operation causes it to go out-of-sync with other MACs, disrupting the precisely timed behavior of the array. Instead, ThunderVolt drops (or zeros out) faulty MAC operations, enabling voltage underscaling without re-execution. Empirically, we observe that with ThunderVolt, we can operate at high timing error rates (up to 10%) without significantly impacting accuracy.
In follow-on work, we mitigate a challenge we faced with ThunderVolt — running detailed timing simulations to predict the timing error behavior of digital logic is prohibitively time-consuming. As a solution, we propose FATE, a simulation framework built based on micro-architectural sampling and approximate timing simulation. FATE provides more than two orders of magnitude speed-up in simulation time, while enabling quick yet accurate exploration of different ThunderVolt design choices.
A related thread of research seeks to mitigate the impact of permanent faults on systolic arrays. Conventionally, faulty chips are discarded, resulting in yield loss. Instead, we propose Fault-aware Pruning (FAP); the idea is to set all DNN weights mapped to faulty MAC units to zero and to fine-tune the remaining DNN weights using incremental re-training to recover accuracy. Empirically, we find that FAP enables DNN accelerators to be used even at relatively high permanent fault rates with only marginal accuracy loss. This work has been further generalized to take into account different DNN mapping strategies and to adapt to process variations.