0% found this document useful (0 votes)

37 views8 pages

How Uber Built Odin To Handle 3.8 Million Containers

Uber developed Odin, an automated system for managing its vast database and storage infrastructure, to replace the manual processes that became unmanageable as the company scaled. Odin can handle 3.8 million containers and integrates with 23 different storage technologies, ensuring high availability and efficient resource management. Key features include self-healing capabilities, workload identity preservation, and dynamic workload rescheduling, allowing Uber to maintain operational stability and optimize resource utilization.

Uploaded by

Skpatt Tassou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views8 pages

How Uber Built Odin To Handle 3.8 Million Containers

Uploaded by

Skpatt Tassou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

How Uber Built Odin to Handle 3.

8
Million Containers
ByteByteGo
Mar 4

Disclaimer: The details in this post have been derived from Uber Engineering
Blog and other sources. All credit for the technical details goes to the Uber
engineering team. The links to the original articles are present in the
references section at the end of the post. We’ve attempted to analyze the
details and provide our input about them. If you find any inaccuracies or
omissions, please leave a comment, and we will do our best to fix them.

In the early days, the engineers at Uber had to take care of databases and
storage systems manually. Whenever they needed to set up, update, or fix
something, they followed written instructions called "runbooks". These
runbooks were like a step-by-step guide.

As Uber grew, this manual process became overwhelming. They had

thousands of databases spread globally, and managing them by hand was a
slow and difficult process prone to mistakes.

To solve this, Uber created Odin, an automated system that manages all
these databases and storage clusters without human intervention. Unlike
older systems that work with only specific types of databases, Odin is
technology-agnostic, meaning it can handle many different databases and
storage systems seamlessly.

Odin helps Uber’s engineers organize, scale, and maintain their storage
infrastructure, ensuring everything runs smoothly and reliably. In this article,
we will look at a comprehensive breakdown of Odin and the challenges Uber
faced while developing it.

The Scale of Odin

Since 2014, Uber’s data infrastructure has expanded at an unprecedented
scale.

What started with a few hundred hosts has now evolved into a massive fleet
of over 100,000 hosts, supporting a huge number of stateful workloads.

These workloads are essential for Uber’s real-time services, including ride-
hailing, food delivery, and payment processing, all of which depend on highly
available and scalable storage solutions.

1/8
As per an estimate, Uber’s fleet collectively manages multiple exbibytes of
storage. To put this into perspective 1 exbibyte equals 1,152,921.5 terabytes
(TB). Multiple exbibytes place Uber’s storage in the zettabyte-scale range.

This level of storage capacity is necessary because Uber generates massive

volumes of real-time data, such as:

Ride and trip history

GPS tracking logs

User profiles and payment transactions

Machine learning models

Operational logs.

Odin allows Uber to manage this enormous ecosystem through automation,

self-healing mechanisms, and efficient resource scheduling. It is responsible
for handling:

300,000 workloads, where each workload represents a collection of

processes running on a machine, similar to Kubernetes pods.

3.8 million individual containers, which means each workload can have
multiple containers running different components of Uber’s stateful
services.

In essence, Odin optimizes how storage is allocated and accessed to ensure

fast read/write performance while preventing unnecessary duplication and
inefficiency. Odin is also technology-agnostic, meaning it doesn’t just manage
a single type of database or storage system but instead integrates with 23
different storage technologies, including:

Traditional Online Databases such as MySQL (relational) and

Cassandra (distributed NoSQL database)

Big Data and Streaming Platforms such as HDFS, Kafka, and Presto

2/8
Resource Scheduling and Workflow Management such as Yarn and
Buildkite

Each of these technologies serves a specific purpose within Uber’s

ecosystem, and Odin ensures they can all operate efficiently within the same
unified infrastructure. For example:

MySQL and Cassandra require high availability and read/write

consistency, so Odin ensures replicas are correctly placed and
synchronized.

Kafka requires high-throughput storage and low-latency access, so

Odin manages partition distribution across nodes.

HDFS and Presto require large-scale batch processing, so Odin makes

sure that storage resources are efficiently utilized.

Odin’s Architecture and Design Principles

Unlike traditional imperative systems that require explicit commands to
perform tasks, Odin follows a declarative model where engineers define what
the system should look like (goal state) rather than how to achieve it.

To maintain the goal state, Odin employs self-healing remediation loops that
continuously monitor the system state and detect deviations. If any deviation
is found, it takes corrective actions automatically without human intervention.

See the diagram below that shows the steps within Odin’s remediation loop.

For example, an engineer defines that a database should always have three
replicas. Odin automatically ensures this is always the case. If a replica fails
or a node crashes, Odin self-heals by spinning up a new instance to meet the
goal state.

This design is similar to Kubernetes but optimized for stateful workloads such
as databases and large-scale storage systems.

3/8
The diagram below shows a high-level view of Odin’s architecture.

The key architectural components of Odin are as follows:

1 - Grail: The Data Integration Platform

Odin’s intelligence depends on having an accurate and up-to-date view of
Uber’s global infrastructure. This is handled by Grail.

Grail provides a real-time snapshot of all hosts, containers, and workloads

across Uber’s infrastructure. It works similarly to the Kubernetes API Server
but at a much larger scale. It allows engineers and remediation loops to query
global system state instantly, enabling informed scheduling and decision-
making.

The key advantages of Grail are as follows:

Operates across tens of thousands of hosts in multiple data centers and

cloud regions.

Works with all storage technologies managed by Odin.

Unlike traditional database monitoring tools that operate per data

center, Grail aggregates data across all Uber locations.

2 - Remediation Loops

At the core of Odin’s automation capabilities are remediation loops.

4/8
Each remediation loop is a separate microservice. This allows independent
development and deployment without affecting other parts of the system. It
follows a four-step cycle:

Inspect the goal state (what the system should look like).

Collect the actual state (what the system currently looks like).

Identify discrepancies between the goal and the actual state.

Trigger corrective actions using Cadence workflows.

This process is continuous, ensuring that Odin actively maintains stability and
performance.

3 - The Control Plane

Odin’s control plane is responsible for orchestrating workloads, scheduling
tasks, and managing cluster topology.

The core responsibilities of the Control Plane are as follows:

Workload Scheduling: It decides where and how workloads should be

deployed based on system health, resource availability, and
performance needs.

Cluster Management and Topology Decisions: Determines how

databases and storage clusters should be distributed across Uber’s
infrastructure.

Workflow Execution (via Cadence): Engineers interact with the

control plane through Cadence workflows. These workflows define
specific actions, such as upgrading a database or rescheduling
workloads.

4 - Host-Level Agents

Odin’s architecture separates global control from local execution using two
host-level agents.

Odin-Agent: This runs on every host in Uber’s infrastructure and

handles generic host-level tasks, such as resource allocation (CPU,
memory, and disk), container lifecycle management, and disk volume
and cgroup management.

Technology-Specific Worker: A containerized agent that runs inside

each database or storage workload. It is tailored to the specific
technology being used (for example, MySQL and Cassandra). This
agent ensures that the database internals align with the goal state.

See the diagram below that shows the two host-level agents in more detail.

5/8
By separating general host management (Odin-Agent) from database-specific
logic (Worker), Odin maintains standardization and customization across
different workloads.

Odin’s Key Features and Innovations

As mentioned, Odin is a stateful workload management system designed to
handle Uber’s massive-scale database and storage infrastructure.

1 - Self Healing System

Odin continuously compares the desired goal state with the actual state of the
system. If a deviation is detected, it automatically triggers corrective actions
through Cadence workflows.

No human intervention is required in this process. The system detects and

fixes failures autonomously. This eliminates downtime caused by manual
debugging. Moreover, Odin constantly fine-tunes itself to maintain optimal
performance.

2 - High Availability and Fault Tolerance

Odin ensures high availability through dynamic workload rescheduling and

failure-handling mechanisms.

60% of workloads are rescheduled each month to balance resource

utilization. Workloads move between hosts dynamically to optimize
performance and maintain resilience against failures. Odin makes real-time
rescheduling decisions based on resource usage, fault tolerance, and
availability requirements.

One of Odin’s key innovations is its make-before-break strategy. Before

shutting down a workload, Odin provisions a replacement instance. The old
instance is only removed once the new one is fully operational. This approach
is important for stateful workloads like databases, where shutting down a
node without a replacement can cause data loss, unavailability, or increased
latency.

6/8
3 - Efficient Storage Management

When Odin was introduced, containerizing databases was still a relatively

uncommon and controversial practice. However, Uber fully embraced
containerized stateful workloads.

An innovation Odin brought to the table was placing multiple database

instances on a single host. Traditionally, databases run on dedicated
machines, wasting resources. Odin enables up to 100 databases to be
colocated on the same host by managing CPU, memory, and disk allocation.

Unlike traditional NAS-based database solutions, Odin relies on locally

attached SSDs and HDDs. This results in lower latency, higher throughput,
and reduced costs.

4 - Workload Identity and Stateful Migration

Stateful applications require persistent identity across rescheduling events.

Unlike Kubernetes StatefulSets, which assigns a stable pod identity, Odin
takes a different approach.

In this approach, workload identity is preserved across rescheduling events.

Data replication is performed before workload termination. Goal-state
propagation ensures that the new instance picks up exactly where the old
one left off. In other words, the workload transition is seamless, with no
downtime or loss of service.

Challenges and Optimization Strategies in Odin

Managing a platform as large as Odin comes with significant engineering
challenges.

As Uber scaled, Odin had to evolve from a human-driven system to a fully

automated, highly coordinated, and resilient infrastructure management
platform.

Some key takeaways are as follows:

The manual workload management was no longer scalable. Fleet-wide

optimizations were needed for resiliency, cost savings, upgrades, and
data center migrations.

Uncoordinated operations (like workload migrations and container

upgrades) could take down clusters. Consensus-based databases
needed careful migration strategies.

A global coordination system was needed to solve these problems. This

was done using predefined budgets for allowable disruptions and
enforcing platform-wide global concurrency constraints.

Conclusion
Odin represents a significant leap in stateful workload management, enabling
Uber to scale its infrastructure from hundreds to over 100,000 hosts while

7/8
maintaining high availability, cost efficiency, and operational stability.

By transitioning from manual, human-driven operations to an intent-based,

self-healing, and automated orchestration system, Odin has eliminated
inefficiencies and minimized downtime in Uber’s mission-critical storage
systems.

The key innovations behind Odin, such as self-healing remediation loops,

intelligent workload rescheduling, make-before-break migrations, and
platform-wide coordination mechanisms, have allowed Uber to achieve 95%+
resource utilization.

As Uber continues to grow, Odin is set to evolve further, integrating new

optimizations, smarter automation, and deeper AI-driven workload
management.

8/8

S105681GC10 OCI Oracle Cloud Infrastructure 2021 Architect Associate (1Z0-1072-21) Course Student Guide
100% (1)
S105681GC10 OCI Oracle Cloud Infrastructure 2021 Architect Associate (1Z0-1072-21) Course Student Guide
482 pages
Airbnb Case Study
No ratings yet
Airbnb Case Study
2 pages
Cisco DevNet Evolving Technologies
No ratings yet
Cisco DevNet Evolving Technologies
196 pages
Cisco ISE - Endpoint Compliance Check Posture
No ratings yet
Cisco ISE - Endpoint Compliance Check Posture
37 pages
Image Intent Query Labeling: Objective
No ratings yet
Image Intent Query Labeling: Objective
4 pages
PowerMac G4 MDD Specs
No ratings yet
PowerMac G4 MDD Specs
2 pages
Data Sheet Roving Edge Ultra
No ratings yet
Data Sheet Roving Edge Ultra
2 pages
Trabajo 1
No ratings yet
Trabajo 1
5 pages
TFG Blas Moreno Laguna
No ratings yet
TFG Blas Moreno Laguna
78 pages
Docker Uber
No ratings yet
Docker Uber
16 pages
API Driven Devops
100% (2)
API Driven Devops
106 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
93 pages
Distributed Python Computation in Mixed Reality Environments
No ratings yet
Distributed Python Computation in Mixed Reality Environments
106 pages
Kubernetes Rook Ceph
No ratings yet
Kubernetes Rook Ceph
28 pages
Meet The Future Edge Programmable Industrial Controllers
No ratings yet
Meet The Future Edge Programmable Industrial Controllers
12 pages
K8 Architectures
No ratings yet
K8 Architectures
30 pages
FULLTEXT01
No ratings yet
FULLTEXT01
52 pages
Implementation of A Control Architecture For Networked Vehicle Systems
No ratings yet
Implementation of A Control Architecture For Networked Vehicle Systems
6 pages
Nodejs at Uber
No ratings yet
Nodejs at Uber
7 pages
OPC UA & Smart-Ready Factory: All About Data Connectivity Solutions
No ratings yet
OPC UA & Smart-Ready Factory: All About Data Connectivity Solutions
46 pages
Virtual IoT Meetup - OpenHAB 2
No ratings yet
Virtual IoT Meetup - OpenHAB 2
37 pages
Kubernetes Book by Rakesh Kumar Jangid
No ratings yet
Kubernetes Book by Rakesh Kumar Jangid
147 pages
Cloud-AI-Native 6G Powered by eBPF
No ratings yet
Cloud-AI-Native 6G Powered by eBPF
20 pages
A Survey On Remote Operation of Road Vehicles
No ratings yet
A Survey On Remote Operation of Road Vehicles
20 pages
EdgeX Foundry Introduction For LF Edge Workshop June2022
No ratings yet
EdgeX Foundry Introduction For LF Edge Workshop June2022
17 pages
CCIE/CCDE Written Exam Evolving Technologies Study Guide
No ratings yet
CCIE/CCDE Written Exam Evolving Technologies Study Guide
180 pages
IoT vs. Edge Computing - What's The Difference - IBM Developer
No ratings yet
IoT vs. Edge Computing - What's The Difference - IBM Developer
10 pages
Robot Operating System
No ratings yet
Robot Operating System
11 pages
Workshop Soft
No ratings yet
Workshop Soft
27 pages
Kivimäki Antti
No ratings yet
Kivimäki Antti
55 pages
AWS DevOps Regular
No ratings yet
AWS DevOps Regular
6 pages
Getting Started with Kubernetes Orchestrate and manage large scale Docker deployments with Kubernetes to unlock greater control over your infrastructure and extend your containerization strategy 1st Edition Jonathan Baier pdf download
No ratings yet
Getting Started with Kubernetes Orchestrate and manage large scale Docker deployments with Kubernetes to unlock greater control over your infrastructure and extend your containerization strategy 1st Edition Jonathan Baier pdf download
71 pages
Workshop Software
No ratings yet
Workshop Software
27 pages
A I - C C A V: Rtificial Ntelligence Based Ybersecurity For Onnected and Utomated Ehicles
No ratings yet
A I - C C A V: Rtificial Ntelligence Based Ybersecurity For Onnected and Utomated Ehicles
158 pages
Cimp 9 View
No ratings yet
Cimp 9 View
4 pages
Nornir The Python Automation Framework
No ratings yet
Nornir The Python Automation Framework
24 pages
CLO061GU06
No ratings yet
CLO061GU06
30 pages
Ipexrf 4
No ratings yet
Ipexrf 4
32 pages
Designing Open RAN Platforms
No ratings yet
Designing Open RAN Platforms
27 pages
Lecture 9
No ratings yet
Lecture 9
76 pages
12fa Docker Golang Sample
No ratings yet
12fa Docker Golang Sample
29 pages
Unit 3 IOT
No ratings yet
Unit 3 IOT
5 pages
CCIE CCDE Written Exam Evolving Technologies Study Guide
100% (1)
CCIE CCDE Written Exam Evolving Technologies Study Guide
179 pages
Devops Chap 1 Jntuh
No ratings yet
Devops Chap 1 Jntuh
12 pages
Kubernetes Full INSAT's Course
No ratings yet
Kubernetes Full INSAT's Course
55 pages
Sailing Into The Future
No ratings yet
Sailing Into The Future
7 pages
Docker in Action - Manning (2016)
100% (4)
Docker in Action - Manning (2016)
306 pages
Ros - Docker WP
No ratings yet
Ros - Docker WP
12 pages
Rute C. Sofia (editor), John Soldatos (editor) - Shaping the Future of IoT with Edge Intelligence_ How Edge Computing Enables the Next Generation of IoT Applications (River Publishers Series in Commun
100% (1)
Rute C. Sofia (editor), John Soldatos (editor) - Shaping the Future of IoT with Edge Intelligence_ How Edge Computing Enables the Next Generation of IoT Applications (River Publishers Series in Commun
460 pages
20181022-Smart Building Edgeline
No ratings yet
20181022-Smart Building Edgeline
25 pages
On Premise vs. Cloud: Cloud Computing Has Grown Very Popular For Enterprises, Everything From
No ratings yet
On Premise vs. Cloud: Cloud Computing Has Grown Very Popular For Enterprises, Everything From
41 pages
Orchestrating Opern RAN
No ratings yet
Orchestrating Opern RAN
7 pages
IoT-Ignite Detailed Overview
No ratings yet
IoT-Ignite Detailed Overview
60 pages
Introducing Domain-Oriented Microservice Architecture - Uber Blog
No ratings yet
Introducing Domain-Oriented Microservice Architecture - Uber Blog
18 pages
Edge Xpert 1 7 Product Specification
No ratings yet
Edge Xpert 1 7 Product Specification
2 pages
Getting Started with Kubernetes Orchestrate and manage large scale Docker deployments with Kubernetes to unlock greater control over your infrastructure and extend your containerization strategy 1st Edition Jonathan Baier instant download
100% (1)
Getting Started with Kubernetes Orchestrate and manage large scale Docker deployments with Kubernetes to unlock greater control over your infrastructure and extend your containerization strategy 1st Edition Jonathan Baier instant download
77 pages
Container Workloads
No ratings yet
Container Workloads
11 pages
Anthos White Paper
No ratings yet
Anthos White Paper
19 pages
Supreme Court Easter Sunday Attacks Judgement
No ratings yet
Supreme Court Easter Sunday Attacks Judgement
59 pages
Unknown Title
No ratings yet
Unknown Title
11 pages
EP155: The Shopify Tech Stack: What Is SSO (Single Sign-On) ?
No ratings yet
EP155: The Shopify Tech Stack: What Is SSO (Single Sign-On) ?
6 pages
Material Balance Planning
No ratings yet
Material Balance Planning
2 pages
A Cheatsheet On Comparing Key-Value Stores
No ratings yet
A Cheatsheet On Comparing Key-Value Stores
5 pages
What Is MCP?
No ratings yet
What Is MCP?
5 pages
How Instagram Scaled Its Infrastructure To Support A Billion Users
No ratings yet
How Instagram Scaled Its Infrastructure To Support A Billion Users
11 pages
Facebook Cassandra
No ratings yet
Facebook Cassandra
10 pages
30 Free APIs For Developers
No ratings yet
30 Free APIs For Developers
6 pages
How Amazon S3 Stores 350 Trillion Objects With 11 Nines of Durability
No ratings yet
How Amazon S3 Stores 350 Trillion Objects With 11 Nines of Durability
12 pages
Input-Output Model
No ratings yet
Input-Output Model
10 pages
How Netflix Stores 140 Million Hours of Viewing Data Per Day
No ratings yet
How Netflix Stores 140 Million Hours of Viewing Data Per Day
10 pages
12 Algorithms For System Design Interviews
No ratings yet
12 Algorithms For System Design Interviews
8 pages
2019 Sri Lanka Easter Bombings
No ratings yet
2019 Sri Lanka Easter Bombings
60 pages
Akademset
No ratings yet
Akademset
3 pages
The Society of Biblical Literature
No ratings yet
The Society of Biblical Literature
3 pages
Architecting Solutions For The Enterprise
No ratings yet
Architecting Solutions For The Enterprise
34 pages
5 Considering The Importance of User Profiles in Interface Design
No ratings yet
5 Considering The Importance of User Profiles in Interface Design
23 pages
The Forrester New Wave - Standalone Chatbots For IT Operations Q2 2019
No ratings yet
The Forrester New Wave - Standalone Chatbots For IT Operations Q2 2019
18 pages
Vinay SAP BI-BO Resume
100% (1)
Vinay SAP BI-BO Resume
3 pages
Avaya - Centrum Brochure - Customer Survey Module
No ratings yet
Avaya - Centrum Brochure - Customer Survey Module
2 pages
Online Project Monitoring System Based On Cloud Computing Platform
No ratings yet
Online Project Monitoring System Based On Cloud Computing Platform
10 pages
Wireless and CELLULAR COMMUNICATION (18EC81) : WCC - Syllabus
No ratings yet
Wireless and CELLULAR COMMUNICATION (18EC81) : WCC - Syllabus
2 pages
DCP Orientation Handbook
No ratings yet
DCP Orientation Handbook
30 pages
FM Exp 5
No ratings yet
FM Exp 5
8 pages
RV 60 C 911
No ratings yet
RV 60 C 911
3 pages
PIC Microcontrollers and Embedded System: Prachi Prakash Tarekar BBE-19001
No ratings yet
PIC Microcontrollers and Embedded System: Prachi Prakash Tarekar BBE-19001
37 pages
Daily Monitoring Checklist
No ratings yet
Daily Monitoring Checklist
4 pages
Numpy Data Analytics
No ratings yet
Numpy Data Analytics
13 pages
Fundamental of Ai PDF
No ratings yet
Fundamental of Ai PDF
2 pages
782 Assignment 2
No ratings yet
782 Assignment 2
13 pages
Synergita - Sales Leadboard - Nov 2024
No ratings yet
Synergita - Sales Leadboard - Nov 2024
5 pages
Stop and Wait ARQ
No ratings yet
Stop and Wait ARQ
14 pages
Programming Manual II Fx1s2
No ratings yet
Programming Manual II Fx1s2
224 pages
Installlation Guide of PM
No ratings yet
Installlation Guide of PM
12 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Huawei AirEngine 6760R-51 & AirEngine 6760R-51E Access Points Datasheet
No ratings yet
Huawei AirEngine 6760R-51 & AirEngine 6760R-51E Access Points Datasheet
15 pages
AWS Interview Questions
60% (5)
AWS Interview Questions
31 pages
Tutorial - How To Use SAP To Analyze Trusses
No ratings yet
Tutorial - How To Use SAP To Analyze Trusses
15 pages
Maintain Assessment Cycle
No ratings yet
Maintain Assessment Cycle
5 pages
Module 1
No ratings yet
Module 1
55 pages
Blackboard Collaborate Ultra Basic Tutorial
No ratings yet
Blackboard Collaborate Ultra Basic Tutorial
7 pages
Installation Manual Powerwifi USB Router
No ratings yet
Installation Manual Powerwifi USB Router
10 pages
Virtual Dressing Room
No ratings yet
Virtual Dressing Room
13 pages
3 Development of Enterprise Systems
No ratings yet
3 Development of Enterprise Systems
38 pages

How Uber Built Odin To Handle 3.8 Million Containers

Uploaded by

How Uber Built Odin To Handle 3.8 Million Containers

Uploaded by

How Uber Built Odin to Handle 3.

As Uber grew, this manual process became overwhelming. They had

The Scale of Odin

This level of storage capacity is necessary because Uber generates massive

Ride and trip history

GPS tracking logs

User profiles and payment transactions

Machine learning models

Odin allows Uber to manage this enormous ecosystem through automation,

300,000 workloads, where each workload represents a collection of

In essence, Odin optimizes how storage is allocated and accessed to ensure

Traditional Online Databases such as MySQL (relational) and

Each of these technologies serves a specific purpose within Uber’s

MySQL and Cassandra require high availability and read/write

Kafka requires high-throughput storage and low-latency access, so

HDFS and Presto require large-scale batch processing, so Odin makes

Odin’s Architecture and Design Principles

The key architectural components of Odin are as follows:

1 - Grail: The Data Integration Platform

Grail provides a real-time snapshot of all hosts, containers, and workloads

The key advantages of Grail are as follows:

Operates across tens of thousands of hosts in multiple data centers and

Works with all storage technologies managed by Odin.

Unlike traditional database monitoring tools that operate per data

At the core of Odin’s automation capabilities are remediation loops.

Identify discrepancies between the goal and the actual state.

Trigger corrective actions using Cadence workflows.

3 - The Control Plane

The core responsibilities of the Control Plane are as follows:

Workload Scheduling: It decides where and how workloads should be

Cluster Management and Topology Decisions: Determines how

Workflow Execution (via Cadence): Engineers interact with the

Odin-Agent: This runs on every host in Uber’s infrastructure and

Technology-Specific Worker: A containerized agent that runs inside

Odin’s Key Features and Innovations

1 - Self Healing System

No human intervention is required in this process. The system detects and

2 - High Availability and Fault Tolerance

Odin ensures high availability through dynamic workload rescheduling and

60% of workloads are rescheduled each month to balance resource

One of Odin’s key innovations is its make-before-break strategy. Before

When Odin was introduced, containerizing databases was still a relatively

An innovation Odin brought to the table was placing multiple database

Unlike traditional NAS-based database solutions, Odin relies on locally

4 - Workload Identity and Stateful Migration

Stateful applications require persistent identity across rescheduling events.

In this approach, workload identity is preserved across rescheduling events.

Challenges and Optimization Strategies in Odin

As Uber scaled, Odin had to evolve from a human-driven system to a fully

Some key takeaways are as follows:

The manual workload management was no longer scalable. Fleet-wide

Uncoordinated operations (like workload migrations and container

A global coordination system was needed to solve these problems. This

By transitioning from manual, human-driven operations to an intent-based,

The key innovations behind Odin, such as self-healing remediation loops,

As Uber continues to grow, Odin is set to evolve further, integrating new

You might also like