mail.google.com-_2
mail.google.com-_2
Every single day millions of people stream their favorite movies and TV
shows on Netflix.
However, Netflix's growth has led to an explosion of time series data (data
recorded over time, like a user’s viewing history). The company relies heavily
on this data to enhance the user experience, but handling such a vast and
ever-increasing amount of information also presents a technical challenge in
the following ways:
Every time a user watches a show or movie, new records are added to
their viewing history. This history keeps growing, making it harder to
store and retrieve efficiently.
When Netflix introduces new content and features, people spend more
time watching. With the rise of binge-watching and higher-quality
streaming (like 4K videos), the amount of viewing data per user is also
increasing.
In this article, we’ll learn how Netflix tackled these problems and improved
their storage system to handle millions of hours of viewing data every day.
Apache Cassandra® allows for a flexible structure, where each row can
store a growing number of viewing records without performance issues.
1/10
Netflix’s system processes significantly more writes (data being stored)
than reads (data being retrieved). The ratio is approximately 9:1,
meaning for every 9 new records added, only 1 is read. Apache
Cassandra® excels in handling such workloads.
See the diagram below, which shows the data model of Apache Cassandra®
using column families.
To structure the data efficiently, Netflix designed a simple yet scalable storage
model in Cassandra®.
Each user’s viewing history was stored under their unique ID (CustomerId).
Every viewing record (such as a movie or TV show watched) was stored in a
separate column under that user’s ID. To handle millions of users, Netflix
used "horizontal partitioning," meaning data was spread across multiple
servers based on CustomerId. This ensured that no single server was
overloaded.
2/10
The diagram below shows how the initial system handled reads and writes to
the viewing history data.
Every time a user started watching a show or movie, Netflix added a new
column to their viewing history record in the database. If the user paused or
stopped watching, that same column was updated to reflect their latest
progress.
While storing data was easy, retrieving it efficiently became more challenging
as users' viewing histories grew. Netflix used three different methods to fetch
data, each with its advantages and drawbacks:
At first, this system worked well because it provided a fast and scalable way
to store viewing history. However, as more users watched more content, this
system started to hit performance limits. Some of the issues were as follows:
3/10
data, these compactions took longer and required more processing
power. Other operations like read repair and full column repair also
became expensive.
To solve this, Netflix redesigned its storage model by splitting viewing history
into two categories:
4/10
Since LiveVH and CompressedVH serve different purposes, they were tuned
differently to maximize performance.
For LiveVH, which stores recent viewing records, Netflix prioritized speed and
real-time updates. Frequent compactions were performed to clean up old
data and keep the system running efficiently. Additionally, a low GC (Garbage
Collection) grace period was set, meaning outdated records were removed
quickly to free up space. Since this data was accessed often, frequent read
repairs were implemented to maintain consistency, ensuring that users
always saw accurate and up-to-date viewing progress.
On the other hand, CompressedVH, which stores older viewing records, was
optimized for storage efficiency rather than speed. Since this data was rarely
updated, fewer compactions were needed, reducing unnecessary processing
overhead. Read repairs were also performed less frequently, as data
consistency was less critical for archival records. The most significant
optimization was compressing the stored data, which drastically reduced the
storage footprint while still making older viewing history accessible when
needed.
If CompressedVH grows too large, retrieving data becomes slow. Also, single
large files create performance bottlenecks when read or written. To avoid
these issues, Netflix introduced chunking, where large compressed data is
split into smaller parts and stored across multiple Apache Cassandra®
database nodes.
This system gave Netflix the headroom needed to handle future growth.
At the same time, Netflix introduced video previews in the user interface (UI),
a feature that allowed users to watch short clips before selecting a title. While
this improved the browsing experience, it also dramatically increased the
volume of time-series data being stored.
5/10
When Netflix analyzed the performance of its system, it found several
inefficiencies:
Netflix’s clients (such as apps and web interfaces) retrieved more data
than they needed. Most queries only required recent viewing data, but
the system fetched entire viewing histories regardless. Because filtering
happened after data was fetched, huge amounts of unnecessary data
were transferred across the network, leading to high bandwidth costs
and slow performance.
Apache Cassandra® had been a solid choice for scalability, but by this stage,
Netflix was already operating one of the largest Apache Cassandra® clusters
in existence. The company had already pushed Apache Cassandra® to its
limits, and without a new approach, performance issues would continue to
worsen as the platform grew.
Netflix solved this by splitting viewing history into three separate categories,
each with its dedicated storage cluster:
Video Previews: Short clips that users watch while browsing content.
Since this data grows quickly but isn’t as important as full plays, it
required a different storage strategy.
6/10
Language Preferences: Information on which subtitles or audio tracks
a user selects. This was previously stored redundantly across multiple
viewing records, wasting storage. Now, it is stored separately and
referenced when needed.
Each of the three data categories (Full Plays, Previews, and Language
Preferences) was assigned its separate cluster.
This allowed Netflix to tune each cluster differently based on how frequently
the data was accessed and how long it needed to be stored. It also prevented
one type of data from overloading the entire system.
Not all data is accessed equally. Users frequently check their recent viewing
history, but older data is rarely needed. To optimize for this, Netflix divided its
data storage into three time-based clusters:
Recent Cluster (Short-Term Data): Stores viewing data from the past
few days or weeks. Optimized for fast reads and writes since recent
data is accessed most frequently.
Past Cluster (Archived Data): Holds viewing records from the past few
months to a few years. Contains detailed records, but is tuned for
slower access since older data is less frequently requested.
7/10
Historical Cluster (Summarized Data for Long-Term Storage):
Stores compressed summaries of viewing history from many years ago.
Instead of keeping every detail, this cluster only retains key information,
reducing storage size.
Many previews were only watched for a few seconds, which wasn’t a
strong enough signal of user interest. Instead of storing every short
preview play, Netflix filtered out these records before they were written
to the database, significantly reducing storage overhead.
Instead of fetching all data at once, the system was redesigned to retrieve
only what was needed:
8/10
Recent to Past Cluster: If a viewing record was older than a certain
threshold, it was moved from the Recent Cluster to the Past Cluster.
This ensured that frequently accessed data stayed in a fast-access
storage area, while older data was archived more efficiently.
Parallel writes & chunking were used to ensure that moving large
amounts of data didn’t slow down the system.
9/10
A new summary cache cluster was introduced, storing precomputed
summaries of viewing data for most users. This meant that instead of
computing summaries every time a user made a request, Netflix could fetch
them instantly from the cache. They managed to achieve a 99% cache hit
rate, meaning that nearly all requests were served from memory rather than
querying Apache Cassandra®, reducing overall database load.
Conclusion
With the growth that Netflix went through, their engineering team had to
evolve the time-series data storage system to meet the increasing demands.
10/10