because they were not good fits. These
teams either would not see significant
benefits from containers or had requirements that Titus could not easily
meet at this stage.
The early users drove our focus on
Netflix and AWS integrations, scheduling performance, and system availability that aided other early adopters. As
we improved these aspects, we began
to work on service job support. Early
service adopters included polyglot applications and those where rapid development iteration was important. These
users drove the scheduler enhancements described earlier, integrations
commonly used by services such as the
automated canary-analysis system, and
better end-to-end developer experience.
Titus currently launches around
150,000 containers daily, and its agent
pool consists of thousands of EC2 VMs
across multiple AWS regions. As usage has grown, so has the investment
in operations. This focus has improved
Titus’s reliability and scalability, and
increased the confidence that internal
teams have in it. As a result, Titus supports a continually growing variety of internal use cases. It powers services that
are part of the customer’s interactive
streaming experience, batch jobs that
drive content recommendations and
purchasing decisions, and applications
that aid studio and content production.
Future Focus Areas
So far, Titus has focused on the basic
features and functionality that enable
Netflix applications to use containers.
As more use cases adopt containers
and as the scale increases, the areas
of development focus are expected to
shift. Examples of key areas where Netflix plans on investing are:
Multi-tenancy. While current container technologies provide important
process-isolation mechanisms, they
do not completely eliminate noisy
neighbor interference. Sharing CPU
resources can lead to context-switch
and cache-contention overheads, 28, 13
and shared kernel components (for example, the Network File System kernel
module) are not all container aware.
We plan on improving the isolation
Titus agents provide at both the user-space and kernel levels.
More reliable scheduling. For both
batch and service applications, there
are a number of advanced scheduler
features that can improve their reli-
ability and efficiency. For example,
Titus currently does not reschedule
a task once it is placed. As the agent
pool changes or other tasks complete,
it would be better for the master to re-
consider a task’s optimal placement,
such as improving its balance across
Better resource efficiency. In addi-
tion to more densely packing EC2
VMs, Titus can improve cloud usage
by more intelligently using resources.
For example, when capacity groups
are allocated but not used, Titus could
run preemptable, best-effort batch
jobs on these idle resources and yield
them to the reserved application when
Similarly, Netflix brokers its already
purchased but idle EC2 Reserved In-
stances among a few internal use cases. 23
Titus could make usage of these instanc-
es easier for more internal teams through
a low-cost, ephemeral agent pool.
While only a fraction of Netflix’s inter-
nal applications use Titus, we believe our
approach has enabled Netflix to quickly
adopt and benefit from containers.
Though the details may be Netflix-
specific, the approach of providing low-
friction container adoption by integrat-
ing with existing infrastructure and
working with the right early adopters can
be a successful strategy for any organiza-
tion looking to adopt containers.
Acknowledgments. We would like to
thank Amit Joshi, Corin Dwyer, Fabio
Kung, Sargun Dhillon, Tomasz Bak,
and Lorin Hochstein for their helpful
input on this article.
Borg, Omega, and Kubernetes
Brendan Burns, Brian Grant, David
Oppenheimer, Eric Brewer and John Wilkes
of Containers with Containers
Containers Push Toward the Mayfly Server
1. AWS EC2 Security Groups for Linux instances; http://
2. AWS Elastic Network Interfaces; http://docs.aws.
3. AWS Identity and Access Management; https://aws.
4. AWS Instance metadata and user data; http://docs.
5. Cloud Native Compute Foundation projects; https://
6. Docker Swarm; https://github.com/docker/swarm.
7. Harris, D. Airbnb is engineering itself into a data-driven
company. Gigaom; https://gigaom.com/2013/07/29/
8. Hindman, B. et al. Mesos: A platform for fine-grained
resource sharing in the data center. In Proceedings
of the 8th Usenix Conference on Networked Systems
Design and Implementation. (2011), 295–308.
9. Hunt, P., Konar, M., Junqueira, F.P., and Reed, B.
Zookeeper: Wait-free coordination for Internet-scale
systems. In Procedings of the USENIX Annual
Technical Conference, June 2010.
10. Kubernetes; http://kubernetes.io.
11. Lakshman, A. and Malik, P. Cassandra —A decentralized
structured storage system. In LADIS, Oct. 2009.
12. Lester, D. All about Apache Aurora; https://blog.
13. Leverich, J. and Kozyrakis, C. Reconciling high server
utilization and sub-millisecond quality-of-service. In
Proceedings of the European Conference on Computer
14. Mesosphere. Apple details how it rebuilt Siri on Mesos,
15. Netflix Archaius; https://github.com/Netflix/archaius.
16. Netflix Atlas; https://github.com/Netflix/atlas.
17. Netflix Edda; https://github.com/Netflix/edda.
18. Netflix Eureka; https://github.com/Netflix/eureka.
19. Netflix Fenzo; https://github.com/Netflix/Fenzo.
20. Netflix Open Source Software Center; https://netflix.
21. Netflix Ribbon; https://github.com/Netflix/ribbon.
22. Netflix Spinnaker; https://www.spinnaker.io/.
23. Park, A., Denlinger, D. and Watson, C. Creating your
own EC2 spot market. Netflix Technology Blog; http://
24. Schmaus, B., Carey, C., Joshi, N., Mahilani, N. and
Podila, S. Stream-processing with Mantis. Netflix
Technology Blog; http://techblog.netflix.com/2016/03/
25. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M. and
Wilkes, J. Omega: Flexible, scalable schedulers for large
compute clusters. In Proceedings of the 8th European
Conference on Computer Systems, 2013, 351–364.
26. Vavilapalli, V.K. et al. Apache Hadoop YARN: Yet
another resource negotiator. In Proceedings of the
4th annual Symposium on Cloud Computing, 2013,
Article No. 5.
27. Wu, S., et al. Evolution of the Netflix Data Pipeline.
Netflix Technology Blog; https://techblog.netflix.
28. Zhang, X. et al. CPI2: CPU performance isolation
for shared compute clusters. In Proceedings of the
European Conference on Computer Systems, 2013.
Andrew Leung (@anwleung) is a senior software engineer
at Netflix, where he helps design, build, and operate Titus.
Prior to Netflix, he worked at NetApp, EMC, and several
startups on distributed file and storage systems.
Andrew Spyker (@aspyker) manages the Titus development
team. His career focus has spanned functional, performance,
and scalability work on middleware and infrastructure.
Before helping with the cloud platform at Netflix, he worked
as a lead performance engineer for IBM WebSphere
software and the IBM cloud.
Tim Bozarth (@timbozarth) is a Netflix platform director
focused on enabling Netflix engineers to efficiently
develop and integrate their applications at scale. His
career has focused on building systems to optimize for
developer productivity and scalability at both Netflix
and a range of startups.
Copyright held by authors/owners.
Publication rights licensed to ACM. $15.00.