In our AI Lab, we merge the stability of a bank with the dynamism of a startup. Our mission is to build groundbreaking AI products from scratch. We're looking for a Senior Platform Engineer to architect and build the high-availability, scalable platform that will power our entire AI operation.
Our platform will be built on a multi-region Azure foundation (AKS + Cosmos DB + Event Hubs). We are just starting to build our Platform team, and you will be a founding member. You won't just be operating a platform; you will be building it from the ground up: from the Terraform code for our AKS clusters to the CI/CD pipelines for our models. This is a hands-on role focused on engineering & automation. We work according to SRE best practices with the goal of creating a platform that will achieve 99.9%+ availability.
What You'll Do
-
Build the Platform from Scratch:
-
Code new AKS clusters, networking (VNet), and IAM guardrails using Terraform and Helm charts.
-
Create "golden" Docker images, GitOps pipelines (ArgoCD/Flux), automatic node provisioning, and scaling policies for both CPU and GPU workloads.
-
Design and implement the core MLOps infrastructure, including artifact repositories, model registries, and feature stores.
-
Automate for Reliability:
-
Implement and fine-tune our observability stack: Azure Monitor metrics, Prometheus, Grafana dashboards.
-
Build automated recovery mechanisms and chaos engineering tests to proactively find and fix weaknesses in the system.
-
Champion Platform Best Practices:
-
Work with development teams to ensure they are building reliable, observable, and secure applications from day one.
-
Create runbooks and documentation to prepare for future incident management.
Key Responsibilities
-
IaC Development and Maintenance: Manage our infrastructure state with Terraform Cloud or Atlantis.
-
Kubernetes Operations: Handle version upgrades, manage node pools (including GPU nodes), and define network policies.
-
Data Environment Reliability: Ensure the reliability of our data stores (e.g., Cosmos DB geo-replication, Event Hubs consumer group management).
-
Security Hardening: Implement security best practices, including CVE scanning for Docker images and regular patching of node AMIs.
-
Observability Pipeline: Manage log processing, alerting rules, and capacity forecasting to stay ahead of problems.
-
Support AI Engineers: Provide a self-service platform and tooling that enables AI Engineers to train, deploy, and monitor their models with minimal friction.
What You'll Bring
-
5+ years of experience in a DevOps, SRE, or Platform Engineering role.
-
Deep, hands-on experience with at least one major cloud provider (Azure is a strong plus).
-
Proven experience with containerization (Docker) and orchestration (Kubernetes) in a production environment.
-
Expertise in Infrastructure as Code (Terraform is a must).
-
Strong programming skills in a scripting language (Python is a strong plus).
-
Experience building and maintaining production-grade CI/CD systems.
-
A proactive mindset focused on preventing incidents rather than just reacting to them.
What We Offer
-
A Green-field Opportunity: You will be building a state-of-the-art AI platform from the ground up, using the best tools for the job.
-
A Modern Toolkit: Work with GitHub, Kubernetes, Managed Grafana, Terraform, and the latest Azure AI services.
-
Real Impact: Your work is the foundation upon which our entire AI strategy is built. You are a critical enabler for the entire team.
-
Focus on Engineering, Not Firefighting: In the initial phase, your role is 100% focused on building and automating, not on reactive, on-call firefighting.
-
A Laid-back, Senior Team: We have one daily stand-up, then we focus on deep work.
-
Competitive Salary.
-
HO-friendly with a cool HQ in Budapest.
This is NOT the job for you if
-
You are looking for a role that is primarily about maintaining existing systems. We are building from scratch.
-
You enjoy manual configuration and doing the same task twice.
-
You are not passionate about building secure, reliable, and highly automated systems
Az állás alapinformációi
- Munkaterület: Fejlesztés
- Nyelvtudás: Nem igényel nyelvtudást
- Műszakrend: Hibrid munkavégzés