Experienced research engineer deploying and accelerating multimodal LLMs end-to-end. I stand up HPC (Surm/Kubernetes) clusters, push GPU MFUs with FSDP/quantization/custom CUDA kernels, and ship ultra-low-latency model serving (Triton Inference/Ray Serve, KV-cache, paged attention, Triton Kernels). Patented inventor who owns SLOs, throughput, and $/token.
Seed-stage startup backed by Khosla Ventures, Greylock, and Asimov Ventures. I build, scale, and ship multimodal LLMs end-to-end—from data to training to ultra-low-latency serving.
Tech: PyTorch, CUDA, Triton, FlashAttention, Slurm, Ray, Kubernetes, Terraform/Ansible, NVMe/IB, S3/GCS, Triton/Ray Serve, gRPC/WebSockets, Prometheus/Grafana
Tech: Kubernetes, Docker, PyTorch, XGBoost, PySpark, Flink, Airflow, gRPC, React/Redux
Skills: ETL · Bandits/RL · Experimentation Platforms · Pricing Science · Distributed Systems
Tech: Amazon Lex, AWS Lambda, Snowflake, Amazon Forecast, AWS Step Functions, Node.js, React, Amplitude, GPT-2, T5, spaCy, EconML (ORF), gRPC
Skills: Conversational AI · Forecasting/MLOps · Causal/Elasticity Modeling · NER/IE · Data Apps & ETL
Patent: US 20240104645A1 (Pricing Optimization & Multi-Armed Bandits); additional patents pending.
Open-source contributions: NVIDIA NeMo, TorchTitan, TorchAO, Liger-Kernel, and Hugging Face.