THESIS - Energy-Efficient eDRAM Compute-in-Memory Architecture using Multiplication-Free Bitwise Operators for DNN Acceleration
- Pioneered an ultra-compact eDRAM-based Compute-in-Memory (CiM) Framework: Developed a novel CiM macro utilizing single-transistor eDRAM cells with an additional two transistors for control, resulting in high-density and low-leakage power storage and processing capabilities superior to traditional SRAM cells.
- Implemented a Multiplication-Free CiM Operator: Eliminated the need for Digital-to-Analog Converters (DACs) for multi-bit precision operations at the bitline level and relaxed the precision demands on the Analog-to-Digital Converters (ADCs), significantly reducing circuit complexity and power overhead.
- Architected a Structured Macro Design: Structured the framework into μ-arrays and μ-channels, where each μ-array stores one weight channel. Neural Network weights are stored across columns, and weight bitplanes are arranged across rows for efficient bitwise operations.
- Enabled Bitwise Multiplication within the Cell: Integrated two control transistors into the eDRAM bitcell, allowing the cell to perform bitwise multiplication by conditionally charging or discharging based on the stored weight bit and the inverted input fed through control signals.
- Optimized Analog-to-Digital Conversion: Employed 4-bit Flash ADCs and Transmission Gate MUXes for bitline summation and analog-to-digital conversion, followed by a digital shift-and-add operation to produce the final output. The Flash ADC was identified as the major power consumer, accounting for ∼64% of total power.
- Validated Robustness and Linearity: Demonstrated high Linearity Accuracy between inputs and the SL line. The design was subjected to up to 35mV of process variability, with two-sigma values well within bounds and the variability statistics following a Gaussian trend, confirming the architecture's stability and reliability.
Dynamic Thread Clustering DRAM Scheduler for Optimized Latency and Throughput in Multi-Core Systems
- Designed and implemented a Multi-Metric Thread Clustering Scheduler that dynamically categorizes concurrent threads into Bandwidth and Latency clusters based on Misses Per Kilo Instruction (MPKI). This method strategically balances DRAM throughput and thread-level fairness.
- Developed an innovative composite "Niceness Metric" (Bank Level Parallelism - Row Buffer Locality) to manage thread priorities within the Bandwidth Cluster, employing a dynamic 800-cycle shuffling to ensure fair resource allocation among memory-intensive applications.
- Engineered a Hybrid Scheduling Core that extends the First-Ready, First-Come, First-Served (FR-FCFS) baseline with both a robust High/Low Water Mark Write-Drain policy to prevent queue overflow and a load-aware Batching Policy to manage core request injection.
- Improved Row Buffer Hit Rate and Reduced Latency by integrating an Aggressive Precharge mechanism that monitors recent column accesses and a comprehensive auto-precharge check, minimizing row contention and rapidly preparing banks for new requests.
- Built a Real-Time Profiling and Adaptation System that recalculates all per-thread statistics (MPKI, Bank Level Parallelism, Row Buffer Locality) over periodic time quanta, enabling the scheduler to adapt cluster assignments dynamically to shifting application memory behavior.
Fault Injection Attack Analysis and Implementation on AES Cryptography using Clock and Power Glitching
- Implemented the Advanced Encryption Standard (AES) Algorithm on an FPGA board to establish a hardware-based cryptographic system for security testing and analysis.
- Executed Advanced Side-Channel Fault Injection Attacks against the cryptographic hardware by strategically injecting clock and power glitches to induce errors and observe fault propagation.
- Successfully Demonstrated a Cryptographic Break of the AES implementation by using hardware fault injection techniques (clock and power glitches), proving vulnerability to physical attacks.
- Developed and utilized Python scripting to control the fault injection hardware (likely a tool like ChipWhisperer, given the context) and automate the entire attack process, from fault injection to data capture and analysis.
45nm Custom Design of an 8-bit MAC Datapath with Integrated SRAM for Neural Network Acceleration
- Designed and optimized a foundational 8-bit Multiplication and Accumulation (MAC) Datapath for Neural Network acceleration, achieving optimized performance and power consumption for critical operations.
- Implemented core arithmetic logic units (ALUs), specifically designing an 8-bit Adder and a 4-bit Multiplier as fundamental components of the high-speed MAC unit.
- Developed an on-chip memory solution by integrating the design of a 32x32 SRAM array to store intermediate data, ensuring fast access and tight integration with the custom MAC unit.
- Completed the physical design (PD) of the entire MAC unit and memory subsystem using Cadence EDA tools and targeting the standard 45nm CMOS technology node, demonstrating proficiency in industry-standard process and tool flows.