HI SYSTEMS
IRIS, the Interactive Real-Virtual Interface SoC is proposed to realize real-time spatial computing. First, the Surface Perception Unit extract surface form trained 3D Guassian Splatting model, and exploit it for reducing energy consumption caused by EMA. Second, reconfigurable ALP-based Multiply-Accumulator Array realizes FP MAC with outstanding area and energy efficiency. Lastly, the Error Direction Cache eliminates redun…
This paper presents BROCA, a low-power and low-latency mobile social agent system-on-chip with 3 key features. 1) ABTU with ACER supports decomposed dimension-aware compute data path for energy-efficient response generation. 2) ACBU with input bit width optimization for agent feedback vocoder compute power reduction; 3) LTMU for low-latency dialogue context RAG while retaining the conversational context. As a result, BROCA …
Slim-Llama is an ASIC designed to address the high energy consumption in large language models due to external memory access. By using binary/ternary quantization and integrating a Sparsity-aware Look-up Table, Slim-Llama improves energy efficiency significantly. Output reuse scheme and index vector reordering enhance performance, achieving up to 4.59× better benchmark energy efficiency than previous state-of-the-art. It is…
This paper presents EdgeDiff, energy efficient diffusion model accelerator for mobile devices with 3 key features. 1) CRMP reorder activation for mixed precision and group quantization, and achieves ~2.72× energy efficiency. 2) CAA PE save 36.6% of MAC power with new PE structure and reduced toggle by BST. 3) TAU and GQU minimize hardware cost of the FP logic for group quantization. Finally, EdgeDiff demonstrates end-to-end…
Space-Mate is the first SLAM chip that produces a complete 3D map and accurate position in real-time, which is achieved with 3 key features: 1) Out-of-Order (OoO) SMoE router to alleviate data transactions for low latency, 2) single-skip (SS) and dual-skip (DS) heterogeneous core architecture to exploit coarse-grained sparsity caused by similar zero patterns in the same expert for high throughput and energy-efficiency, and …
Neural Radiance Field (NeRF) is an emerging computer graphics task that is used for 3D modeling and rendering in the metaverse, providing a user-friendly and immersive experience. However, it has limitations to be accelerated on mobile AR/VR devices due to its memory-intensive hash encoding and extensive computational load. This paper presents NeuGPU to achieve NeRF-based instant 3D modeling and real-time rendering with 3 k…
A low-power and real-time 3D neural rendering processor, MetaVRain, is proposed with 3 key features: 1) a visual perception core for 1120× faster rendering, 2) the 1D and 2D hybrid neural engines for 3.7× higher throughput with 2.4× higher energy efficiency during DNN inference, and 3) a modulo based positional encoding unit to minimize HW realization of the sinusoidal function. It finally achieves a maximum of 118 FPS whil…
We present an eDRAM-based CIM processor called DynaPlasia with a novel triple-mode 3T2C cell and a dynamic reconfigurable core architecture that enables high system efficiency for ML workloads. For ResNet-18 (ImageNet dataset), DynaPlasia achieves system energy efficiency of 37.2TOPS/W and compute density of 2.03TOPS/mm2 at 1.0V and 250MHz for INT4/INT5 activation/weight precision
A low-latency and low-power dense RGB-D acquisition and 3D bounding-box extraction system-on-chip, DSPU, is proposed. The DSPU produces accurate dense RGB-D data through CNN-based monocular depth estimation and sensor fusion with a low-power ToF sensor. Furthermore, it performs a 3D point cloud-based neural network for 3D bounding-box extraction. The architecture of the DSPU accelerates the system by alleviating the data-in…
An effective and high-speed 3D point cloud-based neural network processing unit (PNNPU) is proposed using the block-based point processing. It has three key features: 1) page-based block memory management unit (PMMU) with linked list-based page table (LLPT) for on-chip memory footprint reduction, 2) hierarchical block-wise farthest point sampling (HFPS), and block skipping ball query (BSBQ) for fast and efficient point proc…
We present an energy-efficient deep reinforcement learning (DRL) processor, OmniDRL, for DRL training on edge devices. Recently, the need for DRL training is growing due to the DRL's distint characteristics that can be adapted to each user. However, a massive amount of external and internal memory access limits the implementation of DRL training on resource-constrained platforms. OmniDRL proposes 4 key features that can red…
The authors propose a heterogeneous floating-point (FP) computing architecture to maximize energy efficiency by seperately optimize exponent processing and mantissa processing. The proposed exponent-computing-in-memory (ECIM) architecture and mantissa-free-exponent-computing (MFEC) algorithm reduce the power consumption of both memory and FP MAC while resolving previous FP computing-in-memory processors' limitations. Also, …
This paper presents HNPU, which is an energy-efficient DNN training processor by adopting algorithm-hardward co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive-bandwidth reconfigurab…
Generative adversarial networks (GAN) have a wide range of applications, from image style transfer to synthetic voice generation [1]. GAN applications on mobile devices, such as face-to-Emoji conversion and super-resolution imaging, enable more engaging user interaction. As shown in Fig. 7.4.1, a GAN consists of 2 competing deep neural networks (DNN): a generator and a discriminator. The discriminator is trained, while the …
Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1-6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1-5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and …
Address#1233, School of Electrical Engineering, KAIST, 291 Daehak-ro (373-1 Guseong-dong), Yuseong-gu, Daejeon 34141, Republic of Korea Tel +82-42-350-8068 Fax +82-42-350-3410E-mail sslmaster@kaist.ac.kr·© SSL. All Rights Reserved.·Design by NSTAR