avatar

Qingyao Yang

Master degree candidate with UCAS

1. Education

  • 2022.09 - Present University of Chinese Academy of Sciences, GPA 3.72/4. (also with the Institute of Microelectronics Chinese Academy of Sciences)
  • 2018.09 - 2022.06 Beijing University of Chemical Technology, GPA 87.49/100, Rank 2/58

2. Research Experience

2.1 Overview

My current main research areas are deep learning image processing, digital IC design, and hardware-software co-design.
  • Hardware-Software Co-Design:
    In this field, my current research focuses on optimizing transformer, reducing MACs, and decreasing the weight storage overhead in hardware implementation as well as the inference memory access overhead.

  • Deep Learning & Image Processing:
    My research in this area includes remote sensing image processing and the application of deep learning algorithms. I design high-precision, low-parameter deep learning algorithms for target detection tasks in hyperspectral and nighttime remote sensing images, focusing on YOLO series and transformer. In the remaining year of my master’s program, I will focus on researching spiking neural networks.

  • Digital IC Design:
    My research in this field mainly includes FPGA and ASIC design for image processing. During my undergraduate studies and internship, I performed RTL design for image processing algorithms on FPGA. In my master’s program, I designed a lightweight transformer ASIC based on a 64Kb in-memory computing (CIM) unit.


2.2 Research Area I: Hardware-Software Co-Design

2.2.1 Hardware Friendly Transformer Optimization with Dynamic Attention Matrix Fusion [paper]

Manuscript finished for 2025 ACM/IEEE Design Automation Conference.

Dynamic matrix multiplication (DMM) in multi-head self-attention poses significant challenges for the design of transformer accelerators, especially those based on compute-in-memory (CIM).

  • Additional Memory Access: In CIM macro, weights and inputs of DMM are both generated during runtime. This results in redundant memory access or requires a transpose buffer to handle intermediate data efficiently.
  • Power Consumption: For instance, in MulTCIM (JSSC 2024), during the inference of BERT-base, the and account for of the total power consumption. However, these computations represent only of the total MACs.
DMM factor

I proposed a dynamic attention matrix fusion (DAMF) method to solve the challenges above from the algorithmic structure.

  • For , I introduced a quadratic form fusion of weight matrices and an SVD approximation, transforming DMM into fewer scalar operations and eliminating the linear transformations for generation.
  • For , I proposed approximating softmax using a Maclaurin series and power-of-2 as a shift factor, replacing with a hardware-friendly shift operation.
  • Experimental results show that the proposed method does not cause significant accuracy loss in BERT-base. Additionally, it reduces parameters by 1.99 times, DMM MACs by 284 times, and total MACs by 2.21 times.
DAMFtable

2.3 Research Area II: Deep Learning & Image Processing

2.3.1 Compressive Hyperspectral Target Detection [paper1] | paper2 | [code]

Master's Thesis Project. Published on IEEE Transactions on Geoscience and Remote Sensing

Compressive sensing (CS) is a key technology in hyperspectral imaging. Extracting useful information from compressed images can reduce the computation and storage load, and transmission bandwidth of the front-end platform, thereby contributing to the realization of low-power detection systems. The compressive hyperspectral target detection has not been widely explored, as it faces challenges:

  • Spatial and spectral aliasing from CS can hinder feature extraction.
  • Detection methods cannot adapt to the uncertainty brought by the sensing matrix and sparse matrix in CS.
CSTTD

With the introduction of restricted distribution property (RDP) of compressed hyperspectral image, I proposed the first deep learning method for compressed hyperspectral target detection. My contributions include:

  • Developed a novel Restricted Distribution Property (RDP) in compressive sensing, which preserves the probabilistic characteristics of spectral random vectors.
  • Proposed a triplet transformer with RDP-based data augmentation, and Combined Convolution-based similarity metric.
  • Introduced a two-stage ISIA semi-supervised training method.
  • Experimental results show that the mistake rate is reduced by 39.9%, the signal-to-noise ratio is improved by 4.6 times, and the number of spectral bands is reduced by 84%.

2.3.2 Vessel Detection in Nighttime Remote Sensing Image [paper] | [code]

College Students' Innovative Entrepreneurial Training Plan Program. Published on IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

Nighttime remote sensing serves as a vital supplement to visible light remote sensing, and target detection within it holds significant research value. However, vessel target detection in nighttime remote sensing images lacks dataset support, and traditional target detection methods struggle to adapt to the small scale of vessel targets in remote sensing images. To address these issues, our research proposes:

  • Published the first nighttime remote sensing vessel dataset.
  • Modified YOLOv5 to adapt the small scale targets through Adaptively Spatial Feature Fusion (ASFF) and an improved Feature Pyramid Network (FPN).
  • Performed a sea-land mask based on prior geographic information to eliminate the interference of land false alarms.
DMM factor

2.4 Research Area III: Digital Integrated Circuit Design

2.4.1 CIM-Based Transformer Accelerator for Hyperspectral Target Detection

Based on a bit CIM unit, an ASIC design was implemented for the lightweight hyperspectral target detection transformer proposed in earlier research, with the DAMF optimization. The main works are as follows:

  • Developed the RTL design for the linear layer based on a behavioral model of the CIM unit, implementing weight writing/updating, and MACs operations.
  • Designed a softmax approximation circuit based on the Maclaurin series and power-of-two approximations.
  • Conducted quantization-aware training (QAT) for the transformer and deployed the 8-bit quantized weights to the accelerator.
  • Performed backend design using SIMC 55nm technology, completing synthesis, consistency verification, and ICC layout design, achieving an energy efficiency of 2.81 TOPS/W at 1.32V and 100MHz.
CIM_trans1
CIM_trans2

2.4.2 Multi-Model FFT Processor [code]

Course design of Advanced Digital Integrated Circuit, University of Chinese Academy of Sciences.
FFT

2.4.3 FAST-Feature Detection Circuit [code]

FAST (Features from Accelerated Segment Test) detection is a widely used template and machine learning-based corner detection method. It determines corners by comparing the intensity values of pixels in the neighborhood with the intensity of the center point. The FAST circuit I designed includes the following works:

  • Completed the RTL design for FAST detection, including candidate point selection based on continuous pixels in Bresenham neighborhood and non-maximum suppression.
  • Designed the timing conversion circuit between AXI4-stream and VGA timing.
  • Verification on Xilinx FPGA.
FAST
FASTCMP

2.4.4 Realization of Contrast-Limited Adaptive Histogram Equalization on FPGA [code]

Contrast-Limited Adaptive Histogram Equalization (CLAHE) improves image detail and contrast by performing histogram equalization within local regions and limiting contrast to prevent over-enhancement. The CLAHE circuit I designed includes the following works:

  • Designed a dual-port RAM and ping-pong buffer mechanism to simultaneously perform histogram calculation and equalization for two consecutive image frames, with a RAM control circuit to avoid read-write conflicts.
  • Developed a block-based histogram calculation circuit to divide the global image into local block and accumulate the grayscale histogram within each region.
  • Created an equalization circuit with grayscale value clipping and redistribution functions.
  • Designed a bilinear interpolation circuit to calculate interpolation coefficients based on the distance between pixels and adjacent blocks, addressing image blocking artifacts.
  • Verified the design on Xilinx FPGA with an average percentage error of 0.13%.
CLAHE
CLAHE2