I am a 2nd Year Ph.D. student at the University of Michigan, advised by Dr. Reetuparna Das. I am interested in optimizing the memory architecture for memory bandwidth-bound or latency-bound workloads. My current work uses Compute Express Link (CXL) for memory expansion and host-accelerator/accelerator-accelerator interfaces.

Before starting my Ph.D., I was a hardware engineer at NVIDIA and Silicon Labs, where I worked on the microarchitecture of PCIe controllers and cryptographic accelerators, respectively.

  • Processing in Memory
  • CXL based memory expansion and pooling
  • PhD in Computer Science and Engineering, 2022-Present

    University of Michigan, Ann Arbor

  • BTech in Electrical Engineering, 2016-2020

    Indian Institute of Technology, Jodhpur




ASIC Engineer
August 2021 – July 2022 Bangalore
Microarchitecture and RTL design for PCIe 6.0
Silicon Labs
Design Engineer
August 2020 – August 2021 Hyderabad
Microarchitecture and RTL design for security accelerators (ChaCha20, Poly1305)
Research Intern
May 2019 – July 2019 Bangalore
Worked on designing arithemtic units for Posit numbers as an alternative to floating point


CXL Based Memory Expansion for Databases
CXL memory expansion enabled system for large in-memory databases to avoid I/O accesses. Tables are paritioned and placed in the DRAM -> CXL-memory hierarchical structure based on query workload analysis and mapped to CXL devices to exploit device level parallelism to reduce average access latency
CXL Based Memory Expansion for Databases
CXL Enabled Large Language Model Accelerator
Processing in Memory based LLM accelerator with processing units placed adjacent to memory banks. Multiple PIM accelerators are connected using a CXL network to accomodate the large paramter sizes. Each CXL attached accelerator also has a set of specialized functional units and a RISC-V core to support operations that cannot be executed solely on the PIM PUs.
CXL Enabled Large Language Model Accelerator
N-Way Superscalar RISC-V Core
Supercalar RISC-V core based on MIPS R10K with a multi-ported I-cache with prefetching, branch prediction and non-blocking D-cache with victim cache. Implemented in SystemVerilog for a course project (EECS470)