ALL CAPS, NO SPACES B/T UNDERSCORES PTN_US_GBAMSREQID_
Candidate BeelineID i.e. PTN_US_9999999_SKIPJOHNSON0413
MSP Owner: Thomas Hodges
Targeted - -
REQUIREMENT_CITY - Alpharetta, GA - Need to work from office 3 Days a week - Will be Face2Face Round of Interview
REQUIREMENT_ID-10780881
Role Name - AI SRE
ROLE_DESCRIPTION -
Skill Set - Expertise in UNIX + LINUX Administration + AWS/ AZURE Cloud monitoring + Terraform/ Ansible + Prometheus/ Grafana observability experience).
Work Location - Alpharetta
Experience required for role - 6+ years
• Production experience in SRE / Infrastructure / ops for large-scale systems
• Strong programming/scripting skills (Python, Go, Java, or equivalent)
• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
• Solid experience in capacity planning, performance tuning, scaling, and incident response
• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
• Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
• Excellent communication, documentation, and cross-team collaboration skills
• Proven track record of reducing operational toil via automation
Experience: 6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.
• Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
• Design and build automation for core platform capabilities, reducing manual toil
• Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
• Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
• Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
• Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
• Optimize cost vs. performance tradeoffs in large-scale compute environments
• Harden systems for security, compliance, auditability, and data governance
• Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
• Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
• Maintain runbooks, operational playbooks, documentation, and training materials
• Participate in on-call rotations and respond to production incidents 24/7 as needed
• Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Skills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE)
Experience Required: 6-8, Project Code :
Candidate BeelineID i.e. PTN_US_9999999_SKIPJOHNSON0413
MSP Owner: Thomas Hodges
Targeted - -
REQUIREMENT_CITY - Alpharetta, GA - Need to work from office 3 Days a week - Will be Face2Face Round of Interview
REQUIREMENT_ID-10780881
Role Name - AI SRE
ROLE_DESCRIPTION -
Skill Set - Expertise in UNIX + LINUX Administration + AWS/ AZURE Cloud monitoring + Terraform/ Ansible + Prometheus/ Grafana observability experience).
Work Location - Alpharetta
Experience required for role - 6+ years
• Production experience in SRE / Infrastructure / ops for large-scale systems
• Strong programming/scripting skills (Python, Go, Java, or equivalent)
• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
• Solid experience in capacity planning, performance tuning, scaling, and incident response
• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
• Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
• Excellent communication, documentation, and cross-team collaboration skills
• Proven track record of reducing operational toil via automation
Experience: 6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.
• Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
• Design and build automation for core platform capabilities, reducing manual toil
• Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
• Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
• Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
• Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
• Optimize cost vs. performance tradeoffs in large-scale compute environments
• Harden systems for security, compliance, auditability, and data governance
• Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
• Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
• Maintain runbooks, operational playbooks, documentation, and training materials
• Participate in on-call rotations and respond to production incidents 24/7 as needed
• Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Skills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE)
Experience Required: 6-8, Project Code :