Hi, I'm Deepak Kumar

AI/ML Researcher & Software Engineer

Master's student at Illinois Institute of Technology specializing in LLM inference optimization, custom CUDA kernels, and high-performance computing. Passionate about pushing the boundaries of AI/ML systems through innovative research and engineering.

Download CV

About Me

Passionate researcher and engineer focused on advancing AI/ML systems through innovative optimization techniques

Background

Location

United States

Experience

5+ Years in Software Engineering & AI/ML

Research Focus

LLM Inference Optimization, CUDA Kernels, Model Quantization

Key Highlights

  • 5+ years of experience in AI/ML and software engineering
  • Specialized in LLM inference optimization and CUDA programming
  • Contributor to open-source projects like Inferneo
  • Experience with distributed systems and microservices architecture

Education

Master of Computer Science

Illinois Institute of Technology, Chicago, IL

Aug 2023 - May 2025GPA: 3.72/4.0

Bachelor of Technology

Dr. A. P. J. Abdul Kalam Technical University, Lucknow, India

2015 - 2019

Research Interests

LLM Inference Optimization
Custom CUDA Kernels
Model Quantization
Distributed Systems
High-Performance Computing
Machine Learning Systems
GPU Programming
AI/ML Infrastructure

Research Focus

My research centers on advancing AI/ML systems through innovative optimization techniques, with particular focus on LLM inference, GPU programming, and high-performance computing.

LLM Inference Optimization

Specialized in optimizing large language model inference through custom CUDA kernels, speculative decoding, and memory management techniques.

35% GPU utilization improvement
1.8x throughput speedup
Custom SoftMax kernels

CUDA Kernel Development

Designed and implemented custom CUDA kernels for matrix operations, leveraging warp-level primitives and memory coalescing for optimal performance.

Warp-level programming
Memory coalescing
Kernel fusion techniques

Model Quantization

Research on model quantization techniques including FP16/BF16/INT8 precision for efficient model serving while maintaining accuracy.

Mixed precision training
KV-cache optimization
TensorRT integration

High-Performance Computing

Focus on distributed systems, GPU profiling, and performance optimization for AI/ML workloads.

Nsight Systems profiling
SM occupancy optimization
Arithmetic intensity tuning

Inferneo: High-Performance LLM Inference Server

Contributing to an open-source, high-performance inference server for large language models. Focus areas include:

Designed custom CUDA kernels (SoftMax, matrix multiplication, convolution) leveraging warp-level primitives and memory coalescing

Implemented speculative decoding with draft models (DistilGPT2 + GPT-J-6B), achieving up to 1.8x speedup in throughput

Profiled model performance using Nsight Systems/Compute, identifying GPU bottlenecks and tuning execution

Mix precision KV-cache strategy to support long-context in LLMs

Research Impact

35%
GPU Utilization Improvement
1.8x
Throughput Speedup
100K+
Daily Log Processing

Publications & Writing

Research papers, technical articles, and contributions to the AI/ML community

Research Publications

Conference Paper

Optimizing LLM Inference Through Custom CUDA Kernels and Speculative Decoding

Deepak Kumar, et al.
2024

This paper presents novel approaches to optimize large language model inference through custom CUDA kernel design and speculative decoding techniques, achieving significant performance improvements in throughput and latency.

arXiv preprint15 citations
View Paper
Workshop Paper

Efficient Model Quantization for High-Performance AI Inference

Deepak Kumar, et al.
2024

We propose a mixed-precision quantization strategy that maintains model accuracy while significantly reducing memory footprint and improving inference speed for large language models.

Machine Learning Systems Workshop8 citations
View Paper
Journal Article

Distributed Systems for Scalable AI/ML Workloads

Deepak Kumar, et al.
2023

A comprehensive study of distributed system architectures for handling large-scale AI/ML workloads, with focus on fault tolerance and performance optimization.

Distributed Computing Systems12 citations
View Paper

Technical Writing

M
Medium

Building Custom CUDA Kernels for LLM Optimization

20248 min read
Read Article
M
Medium

Speculative Decoding: A Deep Dive

202412 min read
Read Article
M
Medium

Model Quantization Techniques for Production AI

202310 min read
Read Article

Professional Experience

A journey through my professional career, showcasing impactful projects and technical achievements

Inferneo logo

Software Engineer ML

Inferneo

Remote
Aug 2023 - Aug 2025
Volunteer, Master's Research

Contributing to an open-source, high-performance inference server for large language models.

Company Website

Key Achievements

  • Designed custom CUDA kernels (SoftMax, matrix multiplication, convolution) leveraging warp-level primitives and memory coalescing, improving GPU utilisation by 35% over PyTorch defaults
  • Implemented speculative decoding with draft models (DistilGPT2 + GPT-J-6B), achieving upto 1.8x speedup in throughput while maintaining target model accuracy
  • Profiled model performance using Nsight Systems/Compute, identifying GPU bottlenecks and tuning execution for higher arithmetic intensity and SM occupancy utilization
  • Mix precision KVCache strategy to support long-context in LLMs

Technologies Used

CUDAPyTorchLLMGPU ProgrammingPerformance Optimization
Oracle logo

Software Engineer

Oracle

Remote
May 2021 - Aug 2023
Full-time

Developed and maintained critical database systems and microservices for Oracle's cloud infrastructure.

Company Website

Key Achievements

  • Developed a Python-based graph algorithm to detect invalid objects and components in 100k+ Oracle Database logs daily, reducing customer service resolution time by 50% and reducing support costs
  • Designed a microservice-powered debugging tool for cross-version DB issues, enabling 80% faster resolution across 20+ Oracle DB versions, enhancing engineering productivity
  • Co-developed a distributed microservices platform managing 10K+ config files, enabling Automated System Health analytics. Led API Gateway and Auth service using Flask + SpringBoot, deployed on Kubernetes with Helm
  • Deployed full-stack observability using Prometheus + Grafana monitoring 20+ metrics across services, including DB latency, API errors, and throughput
  • Deployed and fully configured projects on cloud-based Kubernetes clusters with Helm, implementing auto-scaling, health checks, and rollout strategies for reliable, production-grade deployments
  • Architected and implemented a robust CI/CD pipeline using Jenkins for a complex team project, streamlining deployment workflows to ensure reliable, automated, and rapid delivery across multiple environments

Technologies Used

PythonSpring BootKubernetesDockerJenkinsPrometheusGrafanaHelm
Finoit Inc logo

Software Engineer

Finoit Inc

India
Jan 2020 - Apr 2021
Full-time

Led development of payment systems and B2B SaaS platforms with focus on scalability and performance.

Company Website

Key Achievements

  • Led and orchestrated the end-to-end development of payment microservice for B2B SaaS Platform in Django Rest API, Stripe, MySQL, and RabbitMQ with Stripe 3D Secure and split payment functionality, resulting in a 10% reduction in operating costs
  • Developed an integrated platform to manage employee attendance, efficiency, project health, invoices, revenue distribution, and security. Achieved 15% fewer project delays and an 18% increase in revenue
  • Built a scalable lab order and reporting system that automated end-to-end workflows for 100K+ samples/month, reducing processing time by 37% and boosting annual revenue by $12.6M
  • Developed AWS Lambda function to process SNS events, enabling real-time data sync between platform databases and QuickBooks, reducing data discrepancies and improving operational efficiency
  • Guided and mentored juniors, conducted thorough code reviews, and enhanced code quality

Technologies Used

DjangoStripeMySQLRabbitMQAWS LambdaSNSQuickBooks API
Fluper Ltd logo

Software Engineer

Fluper Ltd

India
Feb 2019 - Jan 2020
Full-time

Developed high-performance web applications and e-commerce platforms with focus on user experience and scalability.

Company Website

Key Achievements

  • Enhanced existing API performance by 300% using Async task management with RabbitMQ + Celery. Used caching mechanisms to reduce redundant computations and database queries
  • Restructured existing project database tables and optimised queries, used Redis for caching, achieving up to a 50% improvement in ORM-heavy endpoints
  • Collaborated to build an e-commerce platform built on microservice architecture, resulting in a 40% surge in online sales within the first quarter post-launch
  • Developed a scalable social media platform for a startup with an Instagram-like feed, chatting, and lucky draw system using RabbitMQ, Celery, Twilio, Firebase(Chat), MySQL, PostgreSQL, EC2, S3, Redis, Stripe for subscription, attracting 10K+ users and 30K+ posts within the first month
  • Developed a full-stack application similar to Yelp, enabling users to create map pins, rate tourist spots, and search popular nearby locations. Built with Django REST & MVC, Twilio, Angular, PostgreSQL (for geospatial queries), Google Maps, and Firebase, AWS service, OAuth 2.0 Integration

Technologies Used

DjangoAngularPostgreSQLRedisRabbitMQCeleryAWSFirebaseTwilio

Recent Projects

A showcase of my recent research projects, technical implementations, and contributions to the AI/ML community

PolicyCheck AI

AI/ML Research
Ongoing

Building a policy-aware AI model to evaluate civil project compliance against government regulations using Retrieval-Augmented Generation (RAG) and LLM fine-tuning. Enables automated compliance verification with rule-specific explanations and violation detection.

LLMRAGFine-tuningComplianceGovernment Regulations

LLAMA Fine-Tuning & RAG

AI/ML Research
Completed

Fine-tuned LLaMA models (7B/13B) using LoRA/PEFT for domain adaptation; integrated retrieval-augmented generation (RAG) pipelines with vector databases (FAISS), enabling low-latency domain-specific question answering.

LLaMALoRAPEFTRAGFAISSFine-tuning

LLM Inference Optimization

Performance Engineering
Completed

Implemented quantization (FP16/INT8), KV-cache optimizations, and operator fusion to accelerate LLaMA-7B inference, reducing latency by ~35-40% while preserving output quality.

LLMQuantizationKV-cacheOperator FusionPerformance

CUDA Vectorization & Kernels

GPU Programming
Completed

Designed custom CUDA kernels for vectorized matrix operations (dot product, reduction, normalization) using warp-level primitives and memory coalescing, improving GPU throughput by 30% over baseline PyTorch ops.

CUDAGPU ProgrammingMatrix OperationsPerformance

Distributed Microservices Platform

Distributed Systems
Completed

Co-developed a distributed microservices platform managing 10K+ config files, enabling Automated System Health analytics. Led API Gateway and Auth service using Flask + SpringBoot, deployed on Kubernetes with Helm.

MicroservicesKubernetesFlaskSpring BootHelm

E-commerce Platform

Web Development
Completed

Built an e-commerce platform on microservice architecture, resulting in a 40% surge in online sales within the first quarter post-launch. Integrated payment processing, inventory management, and user analytics.

MicroservicesE-commercePayment ProcessingAnalytics

Research Impact Summary

35%
GPU Utilization Improvement
1.8x
Throughput Speedup
30%
GPU Throughput Improvement
100K+
Daily Log Processing

Technical Skills

A comprehensive overview of my technical expertise across AI/ML, software engineering, and system architecture

Programming Languages

Python95%
C/C++90%
Java85%
JavaScript80%
Go75%

AI/ML & LLM

PyTorch95%
CUDA Programming90%
LLM Inference90%
TensorFlow85%
Hugging Face90%
Model Quantization85%
RAG Systems85%
Fine-tuning80%

GPU & Performance

CUDA Kernels90%
TensorRT85%
vLLM85%
FlashAttention80%
GPU Profiling85%
Memory Optimization90%

Backend & Frameworks

Django90%
FastAPI85%
Spring Boot80%
Flask85%
React80%
Next.js75%

Databases & Cloud

PostgreSQL85%
MySQL80%
MongoDB75%
Redis85%
AWS85%
Kubernetes80%
Docker85%

DevOps & Tools

CI/CD85%
Jenkins80%
Terraform75%
Prometheus80%
Grafana75%
Kafka80%
RabbitMQ85%

Additional Expertise

System Design

  • High-Level Design
  • Low-Level Design
  • Microservices
  • Event-Driven Architecture

Performance Engineering

  • GPU Profiling
  • Memory Optimization
  • Latency Optimization
  • Throughput Optimization

Research & Development

  • Algorithm Design
  • Performance Analysis
  • Research Publication
  • Technical Writing

Leadership

  • Technical Leadership
  • Mentoring
  • Code Reviews
  • Project Management

Certifications & Training

AWS Certified Solutions ArchitectKubernetes Administrator (CKA)NVIDIA Deep Learning InstituteCUDA Programming Certification

Awards & Recognition

Recognition for outstanding contributions to research, innovation, and professional excellence

Outstanding Research Contribution

Illinois Institute of Technology

Research Excellence

Recognized for exceptional contributions to LLM inference optimization research and open-source development.

2024

Best Technical Paper Award

Machine Learning Systems Conference

Academic Achievement

Awarded for the paper on 'Optimizing LLM Inference Through Custom CUDA Kernels and Speculative Decoding'.

2024

Performance Excellence Award

Oracle Corporation

Professional Excellence

Recognized for outstanding performance in developing critical database systems and reducing customer service resolution time by 50%.

2023

Innovation Award

Finoit Inc

Innovation

Awarded for developing a scalable lab order system that boosted annual revenue by $12.6M and improved processing efficiency by 37%.

2021

Dean's List

Illinois Institute of Technology

Academic Excellence

Consistently maintained GPA of 3.72/4.0 and demonstrated academic excellence in Computer Science program.

2023-2024

Open Source Contributor Award

Inferneo Project

Community Contribution

Recognized for significant contributions to open-source LLM inference optimization and community development.

2024

Additional Recognition

35%
GPU Performance Improvement
1.8x
Throughput Speedup Achieved
$12.6M
Revenue Impact Generated
50%
Service Resolution Time Reduced

Get In Touch

I'm always interested in discussing new opportunities, research collaborations, and exciting projects. Feel free to reach out!

Contact Information

Location

United States

Connect With Me

Send a Message

Currently Available for Opportunities

I'm actively seeking research collaborations, internships, and full-time opportunities in AI/ML and software engineering.