Software/IPs

Virtual-Platform Simulators

This is a SystemC-TLM based pre-RTL simulator for a CNN SoC consisting of multiple accelerator cores. The simulator models include accelerator cores, processor cores, AMBA (AXI/ACE) crossbar buses and DRAM/HBM memories. The accelerator cores are based on either DMA or cache. The simulator runs orders of magnitude faster than RTL simulators, estimating the performance and power of the SoC. The simulator helps designers to optimize the SoC dataflow such as the data movements between accelerator cores and DRAM through on-chip buses. The simulator was verified against the Xilinx ZYNQ, showing that the performance can be estimated within 5% error. It can also be plugged into the commercial virtual platform simulators such as Synopsys's Platform Architect.

This is a SystemC-TLM based pre-RTL simulator for a wireless modem SoC. In addition to conventional stream-based architecture, the simulator helps designers to evaluate the bus-based architecture for both DMA-based accelerator (AXI) and cache-based accelerator (ACE). The simulator models include FFT, channel estimator, MIMO (including QRD) and LDPC. The digital baseband modem for WiFi6 (IEEE802.11ax) was chosen as the target system, although the simulator is generally applicable to other standards such as 5G. The simulator is intended to leverage the use of the state-of-the-art EDA tools such as Mentor's Catapult HLS and Synopsys's ASIP Designer. 

This is a SystemC-TLM based simulation model that can be integrated into a well-known pre-RTL virtual platform simulator, the Platform Architect from Synopsys. The simulation model includes a  SystemC-TLM based cycle-accurate model of eFlash-based CiM circuitry. Plugged into the Platform Architect, it helps to estimate the performance impact of CiM circuit parameters, for example, the array dimension, the number of bits/cell etc. When it comes to system-level simulations, the simulation model also enables us to evaluate a variety of inter-accelerator dataflow strategies in terms of performance and power consumption.

https://github.com/SDL-KU/eFlashCIM

Accelerator Simulators

This is a pre-RTL accelerator simulator that models a specific CNN dataflow under given memory constraints. Specifically, the simulator uses a spatial data dependence graph (SDDG) to exploit spatial information, such as which ALU/register each of the instructions involves (e.g., which register location the incoming pixel is loaded into). The simulator also makes it possible to evaluate the impact of the memory constraints imposed by memory block constraints in the memory hierarchy. In order to maximize performance as well as operational correctness, the pre-RTL simulator assumes a latency- and bandwidth-insensitive controller for each PE. It was validated against the Eyeriss implementations, showing that the estimation error is less than 7%. 

https://github.com/SDL-KU/SDDGSim 

Architecture Optimizers/Performance Estimators

This algorithm is newly proposed for evaluating the communication performance of direct memory access (DMA) controlled accelerators. The proposed algorithm can estimate the communication performance accurately for both DRAM-limited and bus-limited cases. In detail, the communication performance for the DRAM-limited case is estimated using dynamic prediction of DRAM command patterns whereas the communication performance for the bus-limited case is estimated based on the maximum bus burst latency. Depending on whether the communication bandwidth is limited by the bus protocol overhead or the DRAM latency, the proposed algorithm estimates the communication bandwidth of a DMA-controlled accelerator according to the performance bottleneck. It is shown that the proposed algorithm significantly improves the estimation accuracy when it is applied to CNNs and wireless communications. Also, when the proposed algorithm together with a full-system simulator is used to explore a design space defined by a set of tile sizes and bus-related parameters, it speeds up conventional algorithms by more than a factor of 100 by filtering out a large number of unpromising design points. It is also shown that the proposed algorithm alone can approach the maximum accelerator performance with a performance degradation of less than 5%.

This algorithm is proposed to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. 

The proposed algorithm predicts a dynamic communication bandwidth of each direct memory access controller (DMAC) based on the runtime state of DMACs, making it possible to estimate the communication amounts handled by DMACs accurately by taking into account the temporal intervals. The experimental results shows that the proposed algorithm can estimate the performance of a multicore accelerator with the estimation error of up to 2.8%, regardless of the system communication bandwidth. In addition, the proposed algorithm is used to explore a design space of accelerator core dimensions, and the resulting optimal core dimension provides a performance gain of 10.8% and 31.2%, compared to the conventional multicore accelerator and single-core accelerator, respectively. This result was also verified by the hardware implementations on Xilinx ZYNQ with a maximum estimation error of 2.9%.

https://github.com/SDL-KU/OptAccTile

The proposed heterogeneous SoC optimizer helps to optimize software-hardware partitioning, operation scheduling and memory blocking by minimizing the number of bus/memory collisions. In order to search the optimal design parameter combination that meets the user requirement, the internal pre-RTL simulator evaluates the performance in a cycle-accurate fashion and the design parameters can be adjusted sequentially to cover all design options. The proposed system simulator is evaluated with an example signal processing algorithm, orthogonal matching pursuit (OMP) algorithm. Performances of four cases of the OMP algorithm are predicted by the proposed simulator and in turn are compared with the actual performances on Xilinx ZYNQ. The proposed simulator can predict the performance of heterogeneous systems on chips with under 5% error for all the candidate architectures for OMP while taking the system bus and memory conflicts into account. Moreover, the optimized heterogeneous SoC architecture for the OMP algorithm improves performance by up to 32% compared with the conventional CAG-based approach. The generality of the proposed optimizer is verified for other applications such as LDPC MIMO-OFDM (IEEE802.11ax) and CNNs (AlexNet). 

https://github.com/SDL-KU/HetSoCOpt