felt we would end up diverging from
their common capabilities quickly
enough to limit their benefits.
Titus consists of a replicated, leader-elected scheduler called Titus Master,
which handles the placement of containers onto a large pool of EC2 virtual
machines called Titus Agents, which
manage each container’s life cycle.
Zookeeper9 manages leader election,
and Cassandra11 persists the master’s
data. The Titus architecture is shown
in Figure 1.
Work in Titus is described by a job
specification that details what to run
(for example, a container image and
entry point), metadata (for example,
the job’s purpose and who owns it),
and what resources are required to run
it, such as CPU, memory, or scheduling constraints (for example, availability zone balancing or host affinity).
Job specifications are submitted to the
master and consist of a number of tasks
that represent an individual instance
of a running application. The master
schedules tasks onto Titus agents that
launch containers based on the task’s
Designing for easy container
adoption. Most Netflix microservices and batch applications are built
around parts of Netflix’s cloud infrastructure, AWS services, or both. The
Netflix cloud infrastructure consists of
a variety of systems that provide core
functionality for a Netflix application
running in the cloud. For example,
Eureka, 18 a service-discovery system,
and Ribbon, 21 an IPC library, provide
the mechanism that connects services.
Atlas, 16 a time-series telemetry system,
and Edda, 17 an indexing service for
cloud resources, provide tooling for
monitoring and analyzing services.
Many of these systems are available
as open source software. 20 Similarly,
many Netflix applications use AWS
services such as S3 (Simple Storage Service) or SQS (Simple Queue Service).
To avoid requiring the applications
using these services to change in order
to adopt containers, Titus integrates
with many of the Netflix cloud and AWS
services, allowing containerized applications to access and use them easily.
Using this approach, application developers can continue to depend on these
existing systems, rather than needing
to adopt alternative, but similar, infra-
Unique Netflix container challenges. In many companies, container
adoption happens when building new
greenfield applications or as part of
a larger infrastructure refactor, such
as moving to the cloud or decomposing a monolithic application into microservices. Container adoption at
Netflix differs because it is driven by
applications that are already running
on a cloud-native infrastructure. This
unique environment influenced how
we approached both the technology
we built and how we managed internal
adoption in several ways:
˲ Since applications were not already
being refactored, it was important that
they could migrate to containers without any significant changes.
˲Since Netflix culture promotes
bottom-up decisions, there is no mandate that teams adopt containers. As
a result, we initially focused on only a
few internal users and use cases that
wanted to try containers and would see
major benefits from adoption.
˲ We expect some applications to
continue to run in VMs while others
run in containers, so it was important
to ensure seamless connectivity between them.
˲ Early container adoption use cases
included both traditional microservices and a wide variety of batch jobs.
Thus, the aim was to support both
kinds of workloads.
˲ Since applications would be mov-
ing from a stable AWS EC2 (Elastic
Compute Cloud) substrate to a new
container-management layer running
on top of EC2, providing an appropri-
ate level of reliability was critical.
Containers In an Existing
Netflix’s unique requirements led us
to develop Titus, a container-management system aimed at Netflix’s cloud
infrastructure. The design of Titus focuses on a few key areas:
˲ Allowing existing Netflix applications to run unmodified in containers,
˲ Enabling these applications to easily use existing Netflix and AWS cloud
infrastructure and services,
˲ Scheduling batch and service jobs
on the same pool of resources, and
˲ Managing cloud capacity effectively and reliability.
Titus was built as a framework on
top of Apache Mesos, 8 a cluster-man-agement system that brokers available
resources across a fleet of machines.
Mesos enabled us to control the aspects we deemed important, such as
scheduling and container execution,
while handling details such as which
machines exist and what resources
are available. Additionally, Mesos was
already being run at large scale at several other major companies. 7, 12, 14 Other
systems, such as Kubernetes10 and
Docker Swarm, 6 which were launched
around the time Titus was developed,
provided their own ways of scheduling
and executing containers. Given the
specific requirements noted here, we
Figure 1. Titus architecture components.