Talks/Courses

Talks

The performance and complexity of an system-on-a-chip (SoC) is largely dominated by the system dataflow between accelerator and memory blocks through on-chip/off-chip interconnects. In this talk, the relevant dataflow simulations using SystemC-TLM are introduced and the optimizations of on-chip/off-chip interconnects using the simulations are explained. In particular, some dataflow simulation results for multi-die package systems (chiplet systems) are presented to show the importance of die-to-die (inter-chiplet) interconnects

Multi-die packages have emerged as one of the promising candidates for advanced AI devices, for example, SIMBA from NVIDIA, Rome/Matisse from AMD etc. This talks deals with the impact of dataflow on the performance and power consumption of chiplet-based AI devices. In detail, the relevant simulation-based dataflow optimization approaches are introduced. Our virtual platform simulator, AccTLMSim, is used to evaluate the performance and power consumption of dataflow early in the design stage (i.e., before RTL design) and with order-of-magnitude faster simulation speed (compared to RTL simulation). 

The performance and power consumption of a system-on-a-chip (SoC) tend to be determined by the data movement across memory hierarchy, the so-called dataflow. In this short talk, the impact of dataflow on the performance and power consumption is explained and our simulation-based dataflow optimization approaches are introduced. Our virtual platform simulator, AccTLMSim, is used to evaluate the performance and power consumption of dataflow early in the design stage (i.e., before RTL design) and with order-of-magnitude faster simulation speed (compared to RTL simulation). AccTLMSim is also to be extended into compute-in-memory (e.g. MRAM, eFlash, FeFET) and multi-chiplet package.

The dataflow of an accelerator SoC has a great impact on the performance and energy efficiency. Thus, it is of significant importance to simulate and optimize a dataflow early in the design, e.g., without going through RTL coding. SystemC-TLM based virtual platform is considered as one of the most promising techniques, which helps us to explore a broad design space of algorithms and architectures a few orders of magnitude faster than the conventional RTL models. In this talk, a virtual platform is proposed to design and optimize a neural network hardware accelerator. The proposed virtual platform can model not only the dataflow inside the accelerator but also the dataflow outside the accelerator in a cycle-accurate manner. The focus of this talk is on the memory-centric neural network SoCs such as DRAM connected accelerators and eFlash-based compute-in-memory (CiM) accelerators. The technical issues related to virtual platform simulation based design optimizations will be discussed in this talk.

The use of the optimum dataflow is a key to the design of hardware accelerator, in particular, from the perspective of energy efficiency. In this talk, a new pre-RTL simulator is proposed to evaluate millions of architectural options in the design space of neural network accelerators, which include the data layout, the bus protocol and the loop tiling. The proposed simulator is based on the transaction-level modeling (TLM) using SystemC that is widely used in several virtual platform simulators such as GreenSoCs and Synopsys Platform Architect. It is shown that the proposed simulator is not only fast (event-driven) but accurate (cycle-accurate). Each design point in the design space takes less than 1 minute. The proposed simulator was verified against the measurements from Digilent ZYBO (Xilinx ZYNQ), showing less than 5% error.

The first part of this talk is about Spatial Data Dependence Graph Simulator for Convolutional Neural Network Accelerators (Jooho Wang). A spatial data dependence graph (S-DDG) is newly proposed to model an accelerator dataflow. The pre-RTL simulator based on the S-DDG helps to explore the design space in the early design phase. The simulation results show the impact of memory latency and bandwidth on a convolutional neural network (CNN) accelerator. The second part of the talk is about Optimizations of Scatter Network for Sparse CNN Accelerators (Sunwoo Kim). Sparse CNN (SCNN) accelerators tend to suffer from the bus contention of its scatter network. This paper considers the optimizations of the scatter network. Several network topologies and arbitration algorithms are evaluated in terms of performance and area.

Convolutional neural network (CNN) emerges as the state-of-the-art technology for computer vision and many other applications. However, it poses serious design challenges to both data center and mobile devices since it involves tremendous amount of computation and memory access. Hardware acceleration is considered as one of the most viable solutions these days. In this seminar, we will introduce the recent design trends of hardware accelerators for CNN, focusing on the custom hardware optimizations.

In this talk, we introduce the design and implementation of multi-mode multi-channel system-on-a-chip (SoC) for mobile communications. The designed transceiver SoC consists of sample rate conversion, channel selection and digital compensation, and supports both WCDMA (3G) and LTE (4G) standards. The entire design procedure ranging from system design down to register transfer level (RTL) design is introduced, for example, showing how to evaluate the fixed-point performance early in the design procedure. The architectures of key building blocks are briefly explained, which include digital filter, fast Fourier transform (FFT) and coordinate digital computer (CORDIC), focusing on the corresponding performance-complexity trade-off.

The application-specific design for signal processing applications tends to necessitate multi-disciplinary knowledge on system, algorithm, architecture and circuit levels. In this talk, we will introduce our application-specific design approaches for various signal processing applications. In addition, we discuss several design challenges involved in system-on-a-chip (SoC) design for neural networks, regarding how to cusmomize the on-chip bus architecture.

The demand for good SoC designers seems to be everlasting since it enables us to implement any emerging technologies (e.g., IoT, artificial intelligence). The paradigm shift toward S/W-centric design does not imply that H/W becomes less important. Rather the knowledge of H/W becomes more crucial, e.g., that of computer architecture. In addition to optimized H/W or S/W design, the optimized on-chip communication becomes a key to the successful SoC design. Hands-on programming (e.g., C/C++) and implementation experiences (e.g., FPGA) are must-haves in the SoC era. Various domain knowledge (e.g., image processing) and mathematics are an absolute plus, in particular for the leaders of a cross-disciplinary team.

Milimeter wave communication is emerging as a promising physical layer technology for 5G communications. Phased antenna array helps to overcome increased signal attenuation in mmW. Antenna gain and nulling capability are highly dependent on array configuration (e.g., antenna selection). RF/LO phase shift (possibly assisted by digital one) seems to be a reasonable RX architecture. Phase shift selection tends to rely on training-based sector search (plus feedback) Sector search space should consider channel & interference condition and phase error. Digital calibration of phase error (and possibly IQ imbalance) may be critical from the perspective of beamforming gain.

Link adaptation is an adaptive way of determining a variety of system parameters such as a Modulation and Coding Scheme (MCS) and a transmit beamforming matrix that suit the given channel condition best. Link adaptation is a key element of a WiFi PHY/MAC system in that its performance (e.g., goodput or energy efficiency) heavily depends on how link adaptation is carried out. The MCS selection is largely categorized into two approaches: ACK-based approach and preamble-based approach. ACK-based approach (e.g., Automatic Rate Fallback (ARF) from Lucent Bell Lab.) simply counts the number of successive successes/failures of data transmissions and increases/decreases the MCS accordingly, whereas preamble-based approach (e.g., Exponential effective SNR mapping (EESM) from Ericsson) measures the SNR based on the received preamble and maps it into the throughput-maximizing MCS.

Digital FrontEnd (DFE) resides between RF/analog transceiver and digital baseband modem. The major roles of DFE include channel selection, sample rate conversion and impairment compensation. Digital up/down conversion may be used to support multi-carrier transmission/reception such as multi-carrier WCDMA and carrier aggregation (LTE-Advanced). DFE involes prohibitively high computational complexity and its operating frequency is generally much higher than that of digital baseband modem. Many sophisticated signal processing algorithms and architectures are useful to the successful implementation of DFE, for example, polynomial interpolation (e.g., Farrow interpolation), reconfigurable digital FIR filtering (e.g., Constant Shift Method (CSM)), COordinate Rotation DIgital Computer (CORDIC). Impairment compensation includes IQ mismatch correction and DC offset correction.

Carrier aggregation is a key feature of 3GPP LTE that addresses the support of higher data rates and utilization of fragmented spectrum holdings. In this talk, the relevant design challenges of terminals are discussed. The transmitter architectures are reviewed, and the minimum amount of power amplifier back-offs is evaluated. In addition, several receiver architectures are compared from the perspective of design tradeoff. The radio impairments affecting the receiver performance are analyzed and the simulation results are provided.

Short Courses

This short course introduces the design and optimization of digital SoC for digital signal processing applications, which includes hardware-softeware partitioning, on-chip communication and hardware-softeware optimization. The design lab includes the hardware and software implementation of fast Fourier transform (FFT) on Xilinx FPGA,  ZYNQ.

This short course deals with high-level design and verification of digital SoC, focusing on computation-intensive systems. The proposed design flow is mainly characterized by the architecture exploration on a high level. In more detail, based on the cycle-accurate model and bit-accurate model of the system written in C++, the resulting performance and complexity metrics are evaluated in the early stage of design flow. For example, the performance metrics such as quantization error and overflow can be evaluated based on the high-level models. The design lab shows that the rough estimate of complexity metrics such as area and speed is carried out based on the high-level models.

This short course deals with the SoC design for digital signal processing applications. The performance and complexity metrics are introduced and the corresponding trade-offs are explained. The proposed design flow ranges from system specifications to algorithm-level design, architecture-level design and logic-level design. The design flow is exemplified with two digital signal processing algorithms - infinite impulse response (IIR) filter and fast Fourier transform (FFT) processor. The design lab consists of functional simulations (C++), cycle-accurate simulations (C++), bit-accurate simulations (C++), logic simulations (VerilogHDL) and logic synthesis (Design Compiler).