Summary
Overview
Work History
Education
Skills
Certification
Contact
Education
Additional Achievements/Interests
Timeline
Generic

Narasimhan Venkadeswaran

San Jose

Summary

Site Reliability Engineer/Devops with over 15 years of experience in designing, implementing, and managing distributed systems and infrastructure. Skilled in optimizing AI and machine learning infrastructure, automating processes, and enhancing the reliability and performance of cloud-based systems. Extensive expertise in Kubernetes, CI/CD pipelines, cloud platforms, and monitoring tools. Proven ability to bridge the gap between development and operations to drive reliability, scalability, and efficiency.

Overview

17
17
years of professional experience
1
1
Certification

Work History

Site Reliability Engineer/DevOps

Domino Data Lab
04.2021 - 09.2024

Site Reliability Engineer

Domino Data Lab, San Francisco, CA 2021 – Present

  • Manage and optimize infrastructure for Domino AI, an enterprise MLOps platform, ensuring high reliability and performance.
  • Develop monitoring solutions for critical AI applications, leveraging tools like Prometheus, Grafana, and the ELK stack.
  • Automate deployments and configuration management using Terraform and Ansible, reducing manual effort and errors.
  • Design CI/CD pipelines for continuous training and deployment of machine learning models using Jenkins and Argo CD.
  • Implement cost-saving strategies on Google Cloud for Vertex AI by automating resource optimization for idle models and data sources.
  • Serve as Embedded SRE within multiple teams, including Infrastructure, IdSM, and Domino Cloud, enhancing collaboration and system efficiency.
  • Maintain security compliance by implementing best practices in networking and data protection.

Principal Systems Engineer/Production Engineer

Yahoo
12.2007 - 04.2021
  • Led Kubernetes-based API middle-tier systems, serving 50,000 RPS with 99.99% uptime.
  • Implemented canary deployments and automated ingress routing for robust, performant deployments.
  • Designed and scaled a 30TB Cassandra backend across five data centers, achieving sub-500ms latency and supporting 100,000 RPS for Yahoo's Media Content Platform.
  • Spearheaded Big Data infrastructure automation with Terraform and Kubernetes, optimizing batch and real-time analytics pipelines for 5+ million daily events.
  • Conducted log analysis with Splunk and ELK, reducing incident resolution time by 40%.
  • Collaborated across teams to implement A/B testing frameworks for traffic delivery and large-scale deployment paradigms.
  • Improved security and compliance by automating CVE detection and integrating OWASP best practices into CI/CD pipelines.

Education

Master of Science - Computer Science

College of Engineering Guindy
Chennai
04-2006

Bachelor of Science - Information Technology

MKU
Madurai
04-2004

Skills

  • Cloud Platforms: AWS, Azure, GCP
  • Infrastructure as Code: Terraform, Ansible, CloudFormation
  • Containerization & Orchestration: Kubernetes, Docker, Helm, Istio, EKS, AKS, GKE
  • CI/CD: Jenkins, Argo CD, CircleCI, GitLab CI/CD
  • Monitoring & Observability: Prometheus, Grafana, Splunk, ELK, New Relic
  • Logging & Analysis: Splunk, Fluentd, Logstash, Kibana
  • Traffic & Load Balancing: NGINX, Apache, Load Balancers (AWS, Azure, GCP)
  • Programming & Scripting: Python, Go, Shell, Java
  • Databases: Cassandra, MySQL, PostgreSQL, Redis
  • Streaming: Apache Storm, Apache Spark
  • Traffic & Load Balancing: NGINX, Apache
  • Databases: MySQL,PostgreSQL,Cassandra,Redis,Me mcache,Apache Druid,Mongodb
  • Message Queues: Apache Kafka, Apache Pulsar, Rabbitmq

Certification

CKA: Certified Kubernetes Administrator.
CKAD Certified Kubernetes Application Developer
HashiCorp Certified: Terraform Associate.
Microsoft Azure FundamentalsAZ-900

AWS Certified Cloud Practitioner
MYSQL DBA Administrator
Linkedin Certifications on Kubernetes, Big Data.

Contact

San Jose, CA 95129

Education

Madurai,India,Chennai,India

Additional Achievements/Interests


Big Data Pipelines
  • Built and maintained batch and Real time pipelines in cloud and on-prem environments.
Achievements
  • Seamless migration of cloud native applications with zero downtime enhancing security,
  • Improved incident resolution times by 40% with enhanced monitoring and alerting systems.
  • Successfully designed and deployed CI/CD pipelines, increasing deployment frequency by 30%.
Operational Revamp of Critical Data Pipeline 
  • Operational Improvement reducing ingestion latency from 1.5s to 500ms at 95% percentile for write throughput of 500k qps.
Interests
  • AI and Machine Learning Infrastructure
  • Cloud Optimization
  • Distributed Systems Design

Timeline

Site Reliability Engineer/DevOps

Domino Data Lab
04.2021 - 09.2024

Principal Systems Engineer/Production Engineer

Yahoo
12.2007 - 04.2021

Master of Science - Computer Science

College of Engineering Guindy

Bachelor of Science - Information Technology

MKU
Narasimhan Venkadeswaran