Back to all jobs

[Remote] Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.

Responsibilities

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Skills

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check
  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Benefits

  • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Company Overview

  • Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.
  • Company H1B Sponsorship

  • Runpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Related remote jobs

    [Remote] Email Marketing Specialist- Global

    Work from home Full-time role

    [Remote] Commercial Restoration Project Manager

    Work from home Full-time role

    [Remote] Senior Product Manager

    Work from home Full-time role

    [Remote] Account Executive

    Work from home Full-time role

    [Remote] Engineering Manager - Front-End (UI/UX)

    Work from home Full-time role

    [Remote] Senior Product Marketing Manager

    Work from home Full-time role

    [Remote] Quality Assurance Engineer

    Work from home Full-time role

    [Remote] Member of Technical Staff, Financial Infrastructure

    Work from home Full-time role

    [Remote] Sr. Product Manager

    Work from home Full-time role

    [Remote] Senior Mobile Engineer

    Work from home Full-time role

    Hiring: Care Coordinator – Remote/No Degree RQD

    Work from home Full-time role

    Podcast Producer – Audio & YouTube Content

    Work from home Full-time role

    DevOps Engineer, Kubernetes

    Work from home Full-time role

    Remote Online Chat Specialist - Entry-Level Customer Support Representative (No Experience Needed) - Flexible Part-Time Position at arenaflex

    Work from home Full-time role

    Unisys MCP Systems Engineer

    Work from home Full-time role

    Experienced Live Chat Agent: Deliver Exceptional Customer Service Remotely with arenaflex

    Work from home Full-time role

    Emergency Management Division Manager

    Work from home Full-time role

    Developer III

    Work from home Full-time role

    Principal Regulatory Affairs Lead (CRO) - Remote

    Work from home Full-time role

    Experienced Customer Support Specialist (Weekends Only, Part-Time, Remote) – Transportation Demand Management and Customer Service

    Work from home Full-time role