The ever-increasing accuracy of artificial neural networks facilitates various applications in consumer electronic devices. Furthermore, neural processing units (NPUs) enable the real-time processing of neural networks by leveraging domain-specific hardware structures along with considerable on-chip buffers. Unfortunately, the data movement between NPU and off-chip memory (e.g., DRAM) can no longer be ignored; hence, it is necessary to accurately take the performance effect of off-chip memory into consideration, particularly at the architectural-level simulation. In this paper, we propose a configurable, cycle-accurate NPU simulation infrastructure that considers not only the latency effect as in the analytical modeling but also the memory bandwidth utilization. Our simulator reveals that the accurate simulation of off-chip memory captures higher latency compared to analytical modeling. Specifically, it demonstrates the total execution time increases of 19.2%, 2.8%, and 16.2% in ResNet-50, YOLOv3, and BERT, respectively.
This work is supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT)(2021-0-00106, AI accelerator-optimized neural network automatic generation technology and open service platform development). Hyokeun Lee is the corresponding author.