Research

The use of accelerators helps to achieve unprecedentedly high energy efficiency not only in data centers but also on mobile devices. For example, the application processor, A9, of Apple's iphone constains more than 30 accelerator blocks. In particular, we focus on the ASIC-type accelerators in our researches, which are the most energy-efficient, in particular, when limited by power consumption. The energy efficiency of recently developed ASIC-type accelerator achieve a few TOPs/J (TOPs/sec/W).

The assumed accelerator consists of multiple processing elements (PEs), each of which is connected to one another through a network-on-a-chip (NoC). Each PE transfers a pixel from/to one of the available memory blocks. Thus it tends to experience order of magnitude different latency and bandwidth depending on which memory block the pixel is transferred from/to. Most state-of-the-art accelerators exploit the memory hierarchy in order to mximize the energy efficiency. Recalling that the memory blocks in the memory hierarchy tend to have order of magnitude different latencies and bandwidths, the data movement across different memory levels, referred to as dataflow, may affect the performance and power consumption significantly. Our major interests lie in the dataflow outside accelerator (e.g., between scratchpad and off-chip memory) as well as dataflow inside accelerator (e.g., between scratchpad and reconfigurable array).

As we come near the end of Moore's law and Dennard scaling, more importance is put on selecting a good (application-specific) architecture at faster time scales. The dataflow of hardware accelerator has a great impact on the performance and energy efficiency. Therefore, it is of significant importance to simulate and optimize a dataflow early in the design, e.g., without going through RTL coding. The purpose is to develop a new design framework that helps to optimize the dataflow and implement the corresponding accelerator.

The importance of the aforementioned design framework is often justified by the increasing demand for algorithm-architecture co-designs. The resulting design flow helps designers to reduce the design cycle and spend more time in optimizing the algorithm and architecture jointly, in particular, with the help of many state-of-the-art EDA tools such as Mentor's Catapult HLS and Synopsys's ASIP Designers. The algorithm-architecture co-design for today's wireless communication systems (e.g., 5G cellular networks) is illustrated below.

The proposed simulator is characterized as event-driven cycle-accuracy virtual platform simulator, which is based on the SystemC TLM (Transaction Level Modeling) version 2.0. The coding style mainly follows that of the open-source virtual platform simulator called GreenSoCs. However, our ambition is to develop a virtual platform library that can be plugged into the commercial virtual platform simulators such as Synopsys's Platform Architect.

The proposed simulator helps the hardware engineer to explore the design space consisting of millions of architectural options. For example, it is capable of keeping track of the behavior of AMBA bus in a cycle-accurate manner. In detail, it helps to evaluate the impact of bus parameters (e.g., arbitration protocol, outstanding transactions etc.) on the performance, as exemplified below.

The utilization rate of hardware resource, e.g., on-chip interconnect, off-chip memory, reconfigurable array etc. can be evaluated, which helps to identify the performance bottleneck of the accelerator-centric SoC. The corresponding design space exploration covers accelerator (e.g., loop tiling), DMA controller (e.g., outstanding transactions), off-chip memory (e.g., data layout) and on-chip interconnect (e.g., arbitration protocol). Each of millions of architectural options can be evaluated within less than a minute.

The proposed simulator has been verified against our in-house prototype system. For our reference design, we rely on Digilent's ZYBO consisting of Xilinx's ZYNQ and Micron's DDR3. According to our measurement results, the performance error is kept below 5%, showing that the cycle-accuracy is achieved by the event-driven modeling.

The example signal processing algorithms include Fast Fourier Transform (FFT), convolutional neural network (CNN) and orthogonal matching pursuit (OMP). The following figure summarizes the portfolio of SoC IPs that we have developed so far. Each of the SoC IPs was designed and optimized based on the proprietary design space exploration, which is illustrated next.