SRE specialized in High Performance Computing/AI

Scaleway 🌐

SRE - High Performance Computing đŸ’»

About the job

  • Join our new team specialized in HPC (High Performance Computing)
  • Deploy and maintain multiple HPC clusters composed of Nvidia hardware
  • Collaborate with Engineering teams to troubleshoot high-impact issues
  • Ensure a high quality of service for customers using observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and escalate hardware and software issues to suppliers
  • Empower teammates to integrate and deploy software components across systems
  • Implement best stability, resiliency, scalability, security, and performance practices

Minimum qualifications

  • Experience in system programming using Python, Bash, Go, etc.
  • Demonstrated ability to troubleshoot production system failures
  • Positive mindset and desire to work with a team
  • Passion for automation and incremental improvements on tooling
  • Experience with Linux systems based on Debian and Centos derivatives
  • Experience with batch job schedulers like Slurm, OAR, SGE
  • Good understanding of computer networks (TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, etc.)
  • Storage knowledge (large pools, NAS, S3, etc.)
  • Experience with Nvidia, Cuda, MPI
  • Good command of English

Preferred qualifications

  • Ability to meticulously identify and solve any kind of bug in any codebase
  • Experience with infrastructure-as-code and continuous deployment
  • Experience dealing with physical hardware automation
  • Experience monitoring & logging systems
  • Experience handling account management (LDAP)
  • Knowledge of at least one cloud platform and related use-cases
  • Experience as an OSS contributor and/or maintainer
  • Knowledge in AI / LLM / ML / neuronal networks

Responsibilities

  • Create or optimize tools & documentation to identify, diagnose, and solve production incidents
  • Troubleshoot high-impact issues by working with multiple Engineering teams
  • Take on-call responsibilities, mitigate issues encountered in production, and answer customers in real time
  • Ensure a high quality of service for customers by leveraging observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and take part in the escalation of hardware and software issues to suppliers
  • Empower teammates to swiftly integrate and deploy software components across systems
  • Help implement best stability, resiliency, scalability, security, and performance practices across systems

Technical Stack

  • Python/Bash
  • MySQL
  • S3 API, Lustre, NAS
  • Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
  • Ansible, Salt
  • GitLab, Nexus
  • Ubuntu, Debian, CentOS
  • Nvidia hardware and software
  • MPI, Module, AI software
  • Slurm
  • K8s
  • Jira, Confluence, Slack, GSuite

Location

  • Paris or Lille (France)

Recruitment Process

  • Screening call (30 mins) with the recruiter
  • Manager Interview (45 mins)
  • Technical Interviews (1h 30 mins)
  • HR Interview (45 mins)
  • Offer sent within 48 hours

Don't meet all the requirements? Apply anyway!

Scaleway 🌐 | Scaleway Blog | Scaleway on X

Skills

Project Management
Confluence
Jira
Management
Slack
Tooling
Bash
Gitlab
Sentry
Data
Elasticsearch
Grafana
Mysql
Backend
Ops
Ansible
Security
Firewall
Cloud
Prometheus

Similar Jobs

brand cover
ingénieur devops / ingénieur cloud (f/h)
JYV' ConsultingPermanent contract
JYV' ConsultingPermanent contract
Lille, FR
& Remote
Hybrid remote
≄ 3 years experience
37k€ ➞ 50k€/year
Docker
Git
Gitlab
29 days ago
brand cover
expert cloud (h/f)
EXTIAPermanent contract
EXTIAPermanent contract
Lille, FR
No remote work
≄ 5 years experience
Azure
Chef
Docker
1 day ago
brand cover
ingénieur réseaux et sécurité h/f
EXTIAPermanent contract
EXTIAPermanent contract
Lille, FR
No remote work
≄ 3 years experience
Network
1 day ago
brand cover
alternance - cloud enablers (f/h)
WorldlinePermanent contract
WorldlinePermanent contract
Seclin, FR
No remote work
Juniors accepted
Ansible
Gitlab
Helm
1 day ago
brand cover
alternance- dba (f/h) (294301)
WorldlinePermanent contract
WorldlinePermanent contract
Seclin, FR
No remote work
Juniors accepted
Mysql
Oracle
Ansible
1 day ago
brand cover
data analyst (h/f)
EXTIAPermanent contract
EXTIAPermanent contract
Lille, FR
No remote work
≄ 3 years experience
SQL
Azure
Ansible
1 day ago
brand cover
ingénieur qa h/f
LR Technologies GroupePermanent contract
LR Technologies GroupePermanent contract
Lille, FR
No remote work
Juniors accepted
Network
Ansible
Docker
2 days ago
brand cover
ingénieur qa - systÚmes et réseaux h/f
LR Technologies GroupePermanent contract
LR Technologies GroupePermanent contract
Lille, FR
& Remote
Hybrid remote
≄ 2 years experience
Network
Vite
Ansible
2 days ago
brand cover
ingénieur qa h/f
LR Technologies GroupePermanent contract
LR Technologies GroupePermanent contract
Lille, FR
No remote work
Juniors accepted
Network
Ansible
Docker
4 days ago