Talks/Courses

Talks

Accelerator dataflow simulations and interconnect optimizations, HSN 2024, South Korea (01/2024)

The performance and complexity of an system-on-a-chip (SoC) is largely dominated by the system dataflow between accelerator and memory blocks through on-chip/off-chip interconnects. In this talk, the relevant dataflow simulations using SystemC-TLM are introduced and the optimizations of on-chip/off-chip interconnects using the simulations are explained. In particular, some dataflow simulation results for multi-die package systems (chiplet systems) are presented to show the importance of die-to-die (inter-chiplet) interconnects.

Dataflow simulations for chiplet-based AI devices, Qualitas Semiconductor, South Korea (07/2023)

Multi-die packages have emerged as one of the promising candidates for advanced AI devices, for example, SIMBA from NVIDIA, Rome/Matisse from AMD etc. This talks deals with the impact of dataflow on the performance and power consumption of chiplet-based AI devices. In detail, the relevant simulation-based dataflow optimization approaches are introduced. Our virtual platform simulator, AccTLMSim, is used to evaluate the performance and power consumption of dataflow early in the design stage (i.e., before RTL design) and with order-of-magnitude faster simulation speed (compared to RTL simulation).

Simulation-based SoC dataflow optimizations, Synopsys, Sunnyvale, CA (06/2023)

The performance and power consumption of a system-on-a-chip (SoC) tend to be determined by the data movement across memory hierarchy, the so-called dataflow. In this short talk, the impact of dataflow on the performance and power consumption is explained and our simulation-based dataflow optimization approaches are introduced. Our virtual platform simulator, AccTLMSim, is used to evaluate the performance and power consumption of dataflow early in the design stage (i.e., before RTL design) and with order-of-magnitude faster simulation speed (compared to RTL simulation). AccTLMSim is also to be extended into compute-in-memory (e.g. MRAM, eFlash, FeFET) and multi-chiplet package.

Virtual platform simulation based optimizations of memory-centric neural network SoCs, Samsung Global Technology Conference, South Korea (10/2022)

The dataflow of an accelerator SoC has a great impact on the performance and energy efficiency. Thus, it is of significant importance to simulate and optimize a dataflow early in the design, e.g., without going through RTL coding. SystemC-TLM based virtual platform is considered as one of the most promising techniques, which helps us to explore a broad design space of algorithms and architectures a few orders of magnitude faster than the conventional RTL models. In this talk, a virtual platform is proposed to design and optimize a neural network hardware accelerator. The proposed virtual platform can model not only the dataflow inside the accelerator but also the dataflow outside the accelerator in a cycle-accurate manner. The focus of this talk is on the memory-centric neural network SoCs such as DRAM connected accelerators and eFlash-based compute-in-memory (CiM) accelerators. The technical issues related to virtual platform simulation based design optimizations will be discussed in this talk.

Dataflow simulations, optimizations and implementations for neural network hardware accelerators, Samsung Electronics, South Korea (12/2019)

The use of the optimum dataflow is a key to the design of hardware accelerator, in particular, from the perspective of energy efficiency. In this talk, a new pre-RTL simulator is proposed to evaluate millions of architectural options in the design space of neural network accelerators, which include the data layout, the bus protocol and the loop tiling. The proposed simulator is based on the transaction-level modeling (TLM) using SystemC that is widely used in several virtual platform simulators such as GreenSoCs and Synopsys Platform Architect. It is shown that the proposed simulator is not only fast (event-driven) but accurate (cycle-accurate). Each design point in the design space takes less than 1 minute. The proposed simulator was verified against the measurements from Digilent ZYBO (Xilinx ZYNQ), showing less than 5% error.

Optimizations of deep neural network accelerators, National Taiwan University (NTU) & National Tsing Hua University (NTHU), Taiwan (03/2019)

The first part of this talk is about Spatial Data Dependence Graph Simulator for Convolutional Neural Network Accelerators (Jooho Wang). A spatial data dependence graph (S-DDG) is newly proposed to model an accelerator dataflow. The pre-RTL simulator based on the S-DDG helps to explore the design space in the early design phase. The simulation results show the impact of memory latency and bandwidth on a convolutional neural network (CNN) accelerator. The second part of the talk is about Optimizations of Scatter Network for Sparse CNN Accelerators (Sunwoo Kim). Sparse CNN (SCNN) accelerators tend to suffer from the bus contention of its scatter network. This paper considers the optimizations of the scatter network. Several network topologies and arbitration algorithms are evaluated in terms of performance and area.

Hardware accelerator for deep neural networks, Ericsson Research, Santa Clara, CA (07/2018)

Convolutional neural network (CNN) emerges as the state-of-the-art technology for computer vision and many other applications. However, it poses serious design challenges to both data center and mobile devices since it involves tremendous amount of computation and memory access. Hardware acceleration is considered as one of the most viable solutions these days. In this seminar, we will introduce the recent design trends of hardware accelerators for CNN, focusing on the custom hardware optimizations.

Multi-band multi-channel terminal SoC for cellular networks, Radio Research Center, Kwangwoon Univ., South Korea (01/2018)

In this talk, we introduce the design and implementation of multi-mode multi-channel system-on-a-chip (SoC) for mobile communications. The designed transceiver SoC consists of sample rate conversion, channel selection and digital compensation, and supports both WCDMA (3G) and LTE (4G) standards. The entire design procedure ranging from system design down to register transfer level (RTL) design is introduced, for example, showing how to evaluate the fixed-point performance early in the design procedure. The architectures of key building blocks are briefly explained, which include digital filter, fast Fourier transform (FFT) and coordinate digital computer (CORDIC), focusing on the corresponding performance-complexity trade-off.

Application-specific design approaches for signal processing applications, CECS, UC Irvine, CA (08/2017)

The application-specific design for signal processing applications tends to necessitate multi-disciplinary knowledge on system, algorithm, architecture and circuit levels. In this talk, we will introduce our application-specific design approaches for various signal processing applications. In addition, we discuss several design challenges involved in system-on-a-chip (SoC) design for neural networks, regarding how to cusmomize the on-chip bus architecture.

Undergraduate seminar series: overview of SoC design, Konkuk University, South Korea (05/2016)

The demand for good SoC designers seems to be everlasting since it enables us to implement any emerging technologies (e.g., IoT, artificial intelligence). The paradigm shift toward S/W-centric design does not imply that H/W becomes less important. Rather the knowledge of H/W becomes more crucial, e.g., that of computer architecture. In addition to optimized H/W or S/W design, the optimized on-chip communication becomes a key to the successful SoC design. Hands-on programming (e.g., C/C++) and implementation experiences (e.g., FPGA) are must-haves in the SoC era. Various domain knowledge (e.g., image processing) and mathematics are an absolute plus, in particular for the leaders of a cross-disciplinary team.

Design challenges of milimeter wave phased antenna array: LG Electronics, South Korea (04/2016)

Milimeter wave communication is emerging as a promising physical layer technology for 5G communications. Phased antenna array helps to overcome increased signal attenuation in mmW. Antenna gain and nulling capability are highly dependent on array configuration (e.g., antenna selection). RF/LO phase shift (possibly assisted by digital one) seems to be a reasonable RX architecture. Phase shift selection tends to rely on training-based sector search (plus feedback) Sector search space should consider channel & interference condition and phase error. Digital calibration of phase error (and possibly IQ imbalance) may be critical from the perspective of beamforming gain.

Advanced link adaptation algorithms for WiFi: LG Electronics, South Korea (02/2015)

Link adaptation is an adaptive way of determining a variety of system parameters such as a Modulation and Coding Scheme (MCS) and a transmit beamforming matrix that suit the given channel condition best. Link adaptation is a key element of a WiFi PHY/MAC system in that its performance (e.g., goodput or energy efficiency) heavily depends on how link adaptation is carried out. The MCS selection is largely categorized into two approaches: ACK-based approach and preamble-based approach. ACK-based approach (e.g., Automatic Rate Fallback (ARF) from Lucent Bell Lab.) simply counts the number of successive successes/failures of data transmissions and increases/decreases the MCS accordingly, whereas preamble-based approach (e.g., Exponential effective SNR mapping (EESM) from Ericsson) measures the SNR based on the received preamble and maps it into the throughput-maximizing MCS.

Digital frontend architecture for multiband multimode cellular radios: LG Electronics, South Korea (01/2014)

Digital FrontEnd (DFE) resides between RF/analog transceiver and digital baseband modem. The major roles of DFE include channel selection, sample rate conversion and impairment compensation. Digital up/down conversion may be used to support multi-carrier transmission/reception such as multi-carrier WCDMA and carrier aggregation (LTE-Advanced). DFE involes prohibitively high computational complexity and its operating frequency is generally much higher than that of digital baseband modem. Many sophisticated signal processing algorithms and architectures are useful to the successful implementation of DFE, for example, polynomial interpolation (e.g., Farrow interpolation), reconfigurable digital FIR filtering (e.g., Constant Shift Method (CSM)), COordinate Rotation DIgital Computer (CORDIC). Impairment compensation includes IQ mismatch correction and DC offset correction.

Carrier aggregation for LTE-Advanced: Design challenges of terminals, RF IC Tech. Workshop, South Korea (09/2013)

Carrier aggregation is a key feature of 3GPP LTE that addresses the support of higher data rates and utilization of fragmented spectrum holdings. In this talk, the relevant design challenges of terminals are discussed. The transmitter architectures are reviewed, and the minimum amount of power amplifier back-offs is evaluated. In addition, several receiver architectures are compared from the perspective of design tradeoff. The radio impairments affecting the receiver performance are analyzed and the simulation results are provided.

Short Courses

Design of digital SoC for digital signal processing using FPGA, IDEC, KAIST, South Korea (02/2017, 04/2023)

This short course introduces the design and optimization of digital SoC for digital signal processing applications, which includes hardware-softeware partitioning, on-chip communication and hardware-softeware optimization. The design lab includes the hardware and software implementation of fast Fourier transform (FFT) on Xilinx FPGA, ZYNQ.

High-level design and verification of digital SoC, IDEC, Kwangwoon University, South Korea (08/2014)

This short course deals with high-level design and verification of digital SoC, focusing on computation-intensive systems. The proposed design flow is mainly characterized by the architecture exploration on a high level. In more detail, based on the cycle-accurate model and bit-accurate model of the system written in C++, the resulting performance and complexity metrics are evaluated in the early stage of design flow. For example, the performance metrics such as quantization error and overflow can be evaluated based on the high-level models. The design lab shows that the rough estimate of complexity metrics such as area and speed is carried out based on the high-level models.

Design of high-performance SoC for digital signal processing applications, IDEC, KAIST, South Korea (07/2014)

This short course deals with the SoC design for digital signal processing applications. The performance and complexity metrics are introduced and the corresponding trade-offs are explained. The proposed design flow ranges from system specifications to algorithm-level design, architecture-level design and logic-level design. The design flow is exemplified with two digital signal processing algorithms - infinite impulse response (IIR) filter and fast Fourier transform (FFT) processor. The design lab consists of functional simulations (C++), cycle-accurate simulations (C++), bit-accurate simulations (C++), logic simulations (VerilogHDL) and logic synthesis (Design Compiler).