Scaleway 🌐
SRE - High Performance Computing 💻
About the job
- Join our new team specialized in HPC (High Performance Computing)
- Deploy and maintain multiple HPC clusters composed of Nvidia hardware
- Collaborate with Engineering teams to troubleshoot high-impact issues
- Ensure a high quality of service for customers using observability and monitoring technologies
- Manage the life cycle of HPC clusters in production and escalate hardware and software issues to suppliers
- Empower teammates to integrate and deploy software components across systems
- Implement best stability, resiliency, scalability, security, and performance practices
Minimum qualifications
- Experience in system programming using Python, Bash, Go, etc.
- Demonstrated ability to troubleshoot production system failures
- Positive mindset and desire to work with a team
- Passion for automation and incremental improvements on tooling
- Experience with Linux systems based on Debian and Centos derivatives
- Experience with batch job schedulers like Slurm, OAR, SGE
- Good understanding of computer networks (TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, etc.)
- Storage knowledge (large pools, NAS, S3, etc.)
- Experience with Nvidia, Cuda, MPI
- Good command of English
Preferred qualifications
- Ability to meticulously identify and solve any kind of bug in any codebase
- Experience with infrastructure-as-code and continuous deployment
- Experience dealing with physical hardware automation
- Experience monitoring & logging systems
- Experience handling account management (LDAP)
- Knowledge of at least one cloud platform and related use-cases
- Experience as an OSS contributor and/or maintainer
- Knowledge in AI / LLM / ML / neuronal networks
Responsibilities
- Create or optimize tools & documentation to identify, diagnose, and solve production incidents
- Troubleshoot high-impact issues by working with multiple Engineering teams
- Take on-call responsibilities, mitigate issues encountered in production, and answer customers in real time
- Ensure a high quality of service for customers by leveraging observability and monitoring technologies
- Manage the life cycle of HPC clusters in production and take part in the escalation of hardware and software issues to suppliers
- Empower teammates to swiftly integrate and deploy software components across systems
- Help implement best stability, resiliency, scalability, security, and performance practices across systems
Technical Stack
- Python/Bash
- MySQL
- S3 API, Lustre, NAS
- Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
- Ansible, Salt
- GitLab, Nexus
- Ubuntu, Debian, CentOS
- Nvidia hardware and software
- MPI, Module, AI software
- Slurm
- K8s
- Jira, Confluence, Slack, GSuite
Location
Recruitment Process
- Screening call (30 mins) with the recruiter
- Manager Interview (45 mins)
- Technical Interviews (1h 30 mins)
- HR Interview (45 mins)
- Offer sent within 48 hours
Don't meet all the requirements? Apply anyway!
Scaleway 🌐 | Scaleway Blog | Scaleway on X