Offres d'emploi Entreprises Tests Blog Je recrute S'inscrire Connexion

Cette offre de poste n'est plus disponible

ScalewayPublié il y a environ 1 mois

Paris, France-Paris, France-Lille, France Lille, France

SRE specialized in High Performance Computing/AI

> 3 années d'expérience

CDI

Site Reliability Engineer (SRE)

Scaleway 🌐

SRE - High Performance Computing 💻

About the job

Join our new team specialized in HPC (High Performance Computing)
Deploy and maintain multiple HPC clusters composed of Nvidia hardware
Collaborate with Engineering teams to troubleshoot high-impact issues
Ensure a high quality of service for customers using observability and monitoring technologies
Manage the life cycle of HPC clusters in production and escalate hardware and software issues to suppliers
Empower teammates to integrate and deploy software components across systems
Implement best stability, resiliency, scalability, security, and performance practices

Minimum qualifications

Experience in system programming using Python, Bash, Go, etc.
Demonstrated ability to troubleshoot production system failures
Positive mindset and desire to work with a team
Passion for automation and incremental improvements on tooling
Experience with Linux systems based on Debian and Centos derivatives
Experience with batch job schedulers like Slurm, OAR, SGE
Good understanding of computer networks (TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, etc.)
Storage knowledge (large pools, NAS, S3, etc.)
Experience with Nvidia, Cuda, MPI
Good command of English

Preferred qualifications

Ability to meticulously identify and solve any kind of bug in any codebase
Experience with infrastructure-as-code and continuous deployment
Experience dealing with physical hardware automation
Experience monitoring & logging systems
Experience handling account management (LDAP)
Knowledge of at least one cloud platform and related use-cases
Experience as an OSS contributor and/or maintainer
Knowledge in AI / LLM / ML / neuronal networks

Responsibilities

Create or optimize tools & documentation to identify, diagnose, and solve production incidents
Troubleshoot high-impact issues by working with multiple Engineering teams
Take on-call responsibilities, mitigate issues encountered in production, and answer customers in real time
Ensure a high quality of service for customers by leveraging observability and monitoring technologies
Manage the life cycle of HPC clusters in production and take part in the escalation of hardware and software issues to suppliers
Empower teammates to swiftly integrate and deploy software components across systems
Help implement best stability, resiliency, scalability, security, and performance practices across systems

Technical Stack

Python/Bash
MySQL
S3 API, Lustre, NAS
Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
Ansible, Salt
GitLab, Nexus
Ubuntu, Debian, CentOS
Nvidia hardware and software
MPI, Module, AI software
Slurm
K8s
Jira, Confluence, Slack, GSuite

Location

Paris or Lille (France)

Recruitment Process

Screening call (30 mins) with the recruiter
Manager Interview (45 mins)
Technical Interviews (1h 30 mins)
HR Interview (45 mins)
Offer sent within 48 hours

Don't meet all the requirements? Apply anyway!

Scaleway 🌐 | Scaleway Blog | Scaleway on X

Accueil>Offre d'emploi

Skills

Gestion de projet

Confluence

Jira

Management

Slack

Tooling

Bash

Gitlab

Sentry

Data

Elasticsearch

Grafana

Mysql

Back-end

Go

Ops

Ansible

Sécurité

Firewall

Cloud

Prometheus

Jobs similaires

Offres d'emploi de Site Reliability Engineer (SRE)

Offres d'emploi en Python

Emploi IT pour profils experimentés

développeur java

UpMan ConsultingCDI

UpMan ConsultingCDI

Wambrechies, FR

- Télétravail

Télétravail hybride

≥ 1 an d'experience

30k€ ➞ 50k€/an

expert cloud (h/f)

Pas de télétravail

≥ 5 ans d'experience

Il y a 18 jours

ingénieur réseaux et sécurité h/f

Pas de télétravail

≥ 3 ans d'experience

Il y a 18 jours

alternance - cloud enablers (f/h)

Pas de télétravail

Juniors acceptés

Il y a 18 jours

alternance- dba (f/h) (294301)

Pas de télétravail

Juniors acceptés

Il y a 18 jours

data analyst (h/f)

Pas de télétravail

≥ 3 ans d'experience

Il y a 18 jours

ingénieur qa h/f

LR Technologies GroupeCDI

LR Technologies GroupeCDI

Pas de télétravail

Juniors acceptés

Il y a 19 jours

ingénieur qa - systèmes et réseaux h/f

LR Technologies GroupeCDI

LR Technologies GroupeCDI

- Télétravail

Télétravail hybride

≥ 2 ans d'experience

Il y a 19 jours

ingénieur qa h/f

LR Technologies GroupeCDI

LR Technologies GroupeCDI

Pas de télétravail

Juniors acceptés

Il y a 21 jours

Offres d'emploi de Site Reliability Engineer (SRE)

Offres d'emploi en Python

Emploi IT pour profils experimentés