Hi, I'm Deepak Kumar
AI/ML Researcher & Software Engineer
Master's student at Illinois Institute of Technology specializing in LLM inference optimization, custom CUDA kernels, and high-performance computing. Passionate about pushing the boundaries of AI/ML systems through innovative research and engineering.
About Me
Passionate researcher and engineer focused on advancing AI/ML systems through innovative optimization techniques
Background
Location
United States
Experience
5+ Years in Software Engineering & AI/ML
Research Focus
LLM Inference Optimization, CUDA Kernels, Model Quantization
Key Highlights
- •5+ years of experience in AI/ML and software engineering
- •Specialized in LLM inference optimization and CUDA programming
- •Contributor to open-source projects like Inferneo
- •Experience with distributed systems and microservices architecture
Education
Master of Computer Science
Illinois Institute of Technology, Chicago, IL
Bachelor of Technology
Dr. A. P. J. Abdul Kalam Technical University, Lucknow, India
Research Interests
Research Focus
My research centers on advancing AI/ML systems through innovative optimization techniques, with particular focus on LLM inference, GPU programming, and high-performance computing.
LLM Inference Optimization
Specialized in optimizing large language model inference through custom CUDA kernels, speculative decoding, and memory management techniques.
CUDA Kernel Development
Designed and implemented custom CUDA kernels for matrix operations, leveraging warp-level primitives and memory coalescing for optimal performance.
Model Quantization
Research on model quantization techniques including FP16/BF16/INT8 precision for efficient model serving while maintaining accuracy.
High-Performance Computing
Focus on distributed systems, GPU profiling, and performance optimization for AI/ML workloads.
Inferneo: High-Performance LLM Inference Server
Contributing to an open-source, high-performance inference server for large language models. Focus areas include:
Designed custom CUDA kernels (SoftMax, matrix multiplication, convolution) leveraging warp-level primitives and memory coalescing
Implemented speculative decoding with draft models (DistilGPT2 + GPT-J-6B), achieving up to 1.8x speedup in throughput
Profiled model performance using Nsight Systems/Compute, identifying GPU bottlenecks and tuning execution
Mix precision KV-cache strategy to support long-context in LLMs
Research Impact
Publications & Writing
Research papers, technical articles, and contributions to the AI/ML community
Research Publications
Optimizing LLM Inference Through Custom CUDA Kernels and Speculative Decoding
This paper presents novel approaches to optimize large language model inference through custom CUDA kernel design and speculative decoding techniques, achieving significant performance improvements in throughput and latency.
Efficient Model Quantization for High-Performance AI Inference
We propose a mixed-precision quantization strategy that maintains model accuracy while significantly reducing memory footprint and improving inference speed for large language models.
Distributed Systems for Scalable AI/ML Workloads
A comprehensive study of distributed system architectures for handling large-scale AI/ML workloads, with focus on fault tolerance and performance optimization.
Technical Writing
Professional Experience
A journey through my professional career, showcasing impactful projects and technical achievements
Software Engineer ML
Inferneo
Contributing to an open-source, high-performance inference server for large language models.
Key Achievements
- •Designed custom CUDA kernels (SoftMax, matrix multiplication, convolution) leveraging warp-level primitives and memory coalescing, improving GPU utilisation by 35% over PyTorch defaults
- •Implemented speculative decoding with draft models (DistilGPT2 + GPT-J-6B), achieving upto 1.8x speedup in throughput while maintaining target model accuracy
- •Profiled model performance using Nsight Systems/Compute, identifying GPU bottlenecks and tuning execution for higher arithmetic intensity and SM occupancy utilization
- •Mix precision KVCache strategy to support long-context in LLMs
Technologies Used
Software Engineer
Oracle
Developed and maintained critical database systems and microservices for Oracle's cloud infrastructure.
Key Achievements
- •Developed a Python-based graph algorithm to detect invalid objects and components in 100k+ Oracle Database logs daily, reducing customer service resolution time by 50% and reducing support costs
- •Designed a microservice-powered debugging tool for cross-version DB issues, enabling 80% faster resolution across 20+ Oracle DB versions, enhancing engineering productivity
- •Co-developed a distributed microservices platform managing 10K+ config files, enabling Automated System Health analytics. Led API Gateway and Auth service using Flask + SpringBoot, deployed on Kubernetes with Helm
- •Deployed full-stack observability using Prometheus + Grafana monitoring 20+ metrics across services, including DB latency, API errors, and throughput
- •Deployed and fully configured projects on cloud-based Kubernetes clusters with Helm, implementing auto-scaling, health checks, and rollout strategies for reliable, production-grade deployments
- •Architected and implemented a robust CI/CD pipeline using Jenkins for a complex team project, streamlining deployment workflows to ensure reliable, automated, and rapid delivery across multiple environments
Technologies Used
Software Engineer
Finoit Inc
Led development of payment systems and B2B SaaS platforms with focus on scalability and performance.
Key Achievements
- •Led and orchestrated the end-to-end development of payment microservice for B2B SaaS Platform in Django Rest API, Stripe, MySQL, and RabbitMQ with Stripe 3D Secure and split payment functionality, resulting in a 10% reduction in operating costs
- •Developed an integrated platform to manage employee attendance, efficiency, project health, invoices, revenue distribution, and security. Achieved 15% fewer project delays and an 18% increase in revenue
- •Built a scalable lab order and reporting system that automated end-to-end workflows for 100K+ samples/month, reducing processing time by 37% and boosting annual revenue by $12.6M
- •Developed AWS Lambda function to process SNS events, enabling real-time data sync between platform databases and QuickBooks, reducing data discrepancies and improving operational efficiency
- •Guided and mentored juniors, conducted thorough code reviews, and enhanced code quality
Technologies Used
Software Engineer
Fluper Ltd
Developed high-performance web applications and e-commerce platforms with focus on user experience and scalability.
Key Achievements
- •Enhanced existing API performance by 300% using Async task management with RabbitMQ + Celery. Used caching mechanisms to reduce redundant computations and database queries
- •Restructured existing project database tables and optimised queries, used Redis for caching, achieving up to a 50% improvement in ORM-heavy endpoints
- •Collaborated to build an e-commerce platform built on microservice architecture, resulting in a 40% surge in online sales within the first quarter post-launch
- •Developed a scalable social media platform for a startup with an Instagram-like feed, chatting, and lucky draw system using RabbitMQ, Celery, Twilio, Firebase(Chat), MySQL, PostgreSQL, EC2, S3, Redis, Stripe for subscription, attracting 10K+ users and 30K+ posts within the first month
- •Developed a full-stack application similar to Yelp, enabling users to create map pins, rate tourist spots, and search popular nearby locations. Built with Django REST & MVC, Twilio, Angular, PostgreSQL (for geospatial queries), Google Maps, and Firebase, AWS service, OAuth 2.0 Integration
Technologies Used
Recent Projects
A showcase of my recent research projects, technical implementations, and contributions to the AI/ML community
PolicyCheck AI
AI/ML ResearchBuilding a policy-aware AI model to evaluate civil project compliance against government regulations using Retrieval-Augmented Generation (RAG) and LLM fine-tuning. Enables automated compliance verification with rule-specific explanations and violation detection.
LLAMA Fine-Tuning & RAG
AI/ML ResearchFine-tuned LLaMA models (7B/13B) using LoRA/PEFT for domain adaptation; integrated retrieval-augmented generation (RAG) pipelines with vector databases (FAISS), enabling low-latency domain-specific question answering.
LLM Inference Optimization
Performance EngineeringImplemented quantization (FP16/INT8), KV-cache optimizations, and operator fusion to accelerate LLaMA-7B inference, reducing latency by ~35-40% while preserving output quality.
CUDA Vectorization & Kernels
GPU ProgrammingDesigned custom CUDA kernels for vectorized matrix operations (dot product, reduction, normalization) using warp-level primitives and memory coalescing, improving GPU throughput by 30% over baseline PyTorch ops.
Distributed Microservices Platform
Distributed SystemsCo-developed a distributed microservices platform managing 10K+ config files, enabling Automated System Health analytics. Led API Gateway and Auth service using Flask + SpringBoot, deployed on Kubernetes with Helm.
E-commerce Platform
Web DevelopmentBuilt an e-commerce platform on microservice architecture, resulting in a 40% surge in online sales within the first quarter post-launch. Integrated payment processing, inventory management, and user analytics.
Research Impact Summary
Technical Skills
A comprehensive overview of my technical expertise across AI/ML, software engineering, and system architecture
Programming Languages
AI/ML & LLM
GPU & Performance
Backend & Frameworks
Databases & Cloud
DevOps & Tools
Additional Expertise
System Design
- High-Level Design
- Low-Level Design
- Microservices
- Event-Driven Architecture
Performance Engineering
- GPU Profiling
- Memory Optimization
- Latency Optimization
- Throughput Optimization
Research & Development
- Algorithm Design
- Performance Analysis
- Research Publication
- Technical Writing
Leadership
- Technical Leadership
- Mentoring
- Code Reviews
- Project Management
Certifications & Training
Awards & Recognition
Recognition for outstanding contributions to research, innovation, and professional excellence
Outstanding Research Contribution
Illinois Institute of Technology
Recognized for exceptional contributions to LLM inference optimization research and open-source development.
Best Technical Paper Award
Machine Learning Systems Conference
Awarded for the paper on 'Optimizing LLM Inference Through Custom CUDA Kernels and Speculative Decoding'.
Performance Excellence Award
Oracle Corporation
Recognized for outstanding performance in developing critical database systems and reducing customer service resolution time by 50%.
Innovation Award
Finoit Inc
Awarded for developing a scalable lab order system that boosted annual revenue by $12.6M and improved processing efficiency by 37%.
Dean's List
Illinois Institute of Technology
Consistently maintained GPA of 3.72/4.0 and demonstrated academic excellence in Computer Science program.
Open Source Contributor Award
Inferneo Project
Recognized for significant contributions to open-source LLM inference optimization and community development.
Additional Recognition
Get In Touch
I'm always interested in discussing new opportunities, research collaborations, and exciting projects. Feel free to reach out!
Send a Message
Currently Available for Opportunities
I'm actively seeking research collaborations, internships, and full-time opportunities in AI/ML and software engineering.