Cette offre de poste n'est plus disponible
ScalewayPublié il y a environ 1 mois
Logo Scaleway

SRE specialized in High Performance Computing/AI

Scaleway 🌐

SRE - High Performance Computing 💻

About the job

  • Join our new team specialized in HPC (High Performance Computing)
  • Deploy and maintain multiple HPC clusters composed of Nvidia hardware
  • Collaborate with Engineering teams to troubleshoot high-impact issues
  • Ensure a high quality of service for customers using observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and escalate hardware and software issues to suppliers
  • Empower teammates to integrate and deploy software components across systems
  • Implement best stability, resiliency, scalability, security, and performance practices

Minimum qualifications

  • Experience in system programming using Python, Bash, Go, etc.
  • Demonstrated ability to troubleshoot production system failures
  • Positive mindset and desire to work with a team
  • Passion for automation and incremental improvements on tooling
  • Experience with Linux systems based on Debian and Centos derivatives
  • Experience with batch job schedulers like Slurm, OAR, SGE
  • Good understanding of computer networks (TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, etc.)
  • Storage knowledge (large pools, NAS, S3, etc.)
  • Experience with Nvidia, Cuda, MPI
  • Good command of English

Preferred qualifications

  • Ability to meticulously identify and solve any kind of bug in any codebase
  • Experience with infrastructure-as-code and continuous deployment
  • Experience dealing with physical hardware automation
  • Experience monitoring & logging systems
  • Experience handling account management (LDAP)
  • Knowledge of at least one cloud platform and related use-cases
  • Experience as an OSS contributor and/or maintainer
  • Knowledge in AI / LLM / ML / neuronal networks

Responsibilities

  • Create or optimize tools & documentation to identify, diagnose, and solve production incidents
  • Troubleshoot high-impact issues by working with multiple Engineering teams
  • Take on-call responsibilities, mitigate issues encountered in production, and answer customers in real time
  • Ensure a high quality of service for customers by leveraging observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and take part in the escalation of hardware and software issues to suppliers
  • Empower teammates to swiftly integrate and deploy software components across systems
  • Help implement best stability, resiliency, scalability, security, and performance practices across systems

Technical Stack

  • Python/Bash
  • MySQL
  • S3 API, Lustre, NAS
  • Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
  • Ansible, Salt
  • GitLab, Nexus
  • Ubuntu, Debian, CentOS
  • Nvidia hardware and software
  • MPI, Module, AI software
  • Slurm
  • K8s
  • Jira, Confluence, Slack, GSuite

Location

  • Paris or Lille (France)

Recruitment Process

  • Screening call (30 mins) with the recruiter
  • Manager Interview (45 mins)
  • Technical Interviews (1h 30 mins)
  • HR Interview (45 mins)
  • Offer sent within 48 hours

Don't meet all the requirements? Apply anyway!

Scaleway 🌐 | Scaleway Blog | Scaleway on X

Skills

Gestion de projet
Confluence
Jira
Management
Slack
Tooling
Bash
Gitlab
Sentry
Data
Elasticsearch
Grafana
Mysql
Back-end
Ops
Ansible
Sécurité
Firewall
Cloud
Prometheus