Job Description

Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For more than a decade, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future.

As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon.

The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs.

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise.
Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability.
Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams.
Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do.
Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies.
Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability.

Requirements

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.
Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod).
Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible).
Experience with cloud infrastructure (AWS or GCP).
Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
Experience with distributed storage technologies (Lustre, Ceph, S3).
Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation.
Bachelor degree in computer science or equivalent experience.

Preferred Qualifications

Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark).
Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed).
Familiarity with hybrid/on-prem environments.
Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments.
Experience with HPC networking (InfiniBand, RDMA).
Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust).

Job Tags

Remote job, Full time,

Similar Jobs

Brian-Kyles

Snow & Ice Team Member Job at Brian-Kyles

...snow work offers a unique opportunity to be part of a dedicated team that ensures safety... ...ongoing snow and ice removal tasks during all hours of the day and night, in all outdoor... ...work ethic, think creatively, and manage time and projects efficiently and safely Work...

Virtua Memorial Hospital

Patient Care Technician (PCT) Job at Virtua Memorial Hospital

...2850 skilled and compassionate doctors physician assistants and nurse practitioners equipped with the latest technologies treatments and... .... Location: Voorhees - 100 Bowman Drive Remote Type: On-Site Employment Type: Employee Employment Classification...

Greencroft Communities

Weekend RN Nurse Supervisor Job at Greencroft Communities

...committed to providing exceptional care to our residents while supporting and valuing our team members. We are looking for a dedicated Weekend RN Supervisor to lead our nursing staff in delivering high-quality care and service in both our Skilled Nursing and Assisted Living...

City Thrift Tupelo,MS #123 - Tupelo

Local Route Driver Job at City Thrift Tupelo,MS #123 - Tupelo

...Immediate opening for Local Route Driver! Home every night! Full time, first shift, fixed schedule with limited weekend work. Seeking dedicated, goal-oriented drivers who are passionate about excellence, demonstrate a strong work ethic, and thrive in a structured,...

Kheir Clinic (FQHC)

Health Education & Outreach Specialist Job at Kheir Clinic (FQHC)

...primary healthcare and human services support to the underserved and uninsured residents of Southern California. HEALTH EDUCATION & OUTREACH SPECIALIST ASSIGNMENT SUMMARY: The Enrollment & Outreach Specialist will conduct brief health education/awareness and...

Senior Cluster Site Reliability Engineer Job at Voleon, Remote

cUgvVzFFd1lSL0t4dnYrbjZtQlhMazN3SHc9PQ==