0% found this document useful (0 votes)
4 views

mail.google.com-_2

Netflix generates vast amounts of viewing data daily, which poses challenges in storage and retrieval as user histories grow. Initially relying on Apache Cassandra®, Netflix faced performance issues that led to a redesign of their storage architecture, categorizing data by type and age for improved efficiency. The new system, incorporating caching and optimized data management, allows Netflix to scale effectively while enhancing user experience.

Uploaded by

Skpatt Tassou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

mail.google.com-_2

Netflix generates vast amounts of viewing data daily, which poses challenges in storage and retrieval as user histories grow. Initially relying on Apache Cassandra®, Netflix faced performance issues that led to a redesign of their storage architecture, categorizing data by type and age for improved efficiency. The new system, incorporating caching and optimized data management, allows Netflix to scale effectively while enhancing user experience.

Uploaded by

Skpatt Tassou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

How Netflix Stores 140 Million Hours

of Viewing Data Per Day


ByteByteGo
Mar 18

Every single day millions of people stream their favorite movies and TV
shows on Netflix.

With each stream, a massive amount of data is generated: what a user


watches, when they pause, rewind, or stop, and what they return to later. This
information is essential for providing users with features like resume
watching, personalized recommendations, and content suggestions.

However, Netflix's growth has led to an explosion of time series data (data
recorded over time, like a user’s viewing history). The company relies heavily
on this data to enhance the user experience, but handling such a vast and
ever-increasing amount of information also presents a technical challenge in
the following ways:

Every time a user watches a show or movie, new records are added to
their viewing history. This history keeps growing, making it harder to
store and retrieve efficiently.

As millions of people use Netflix, the number of viewing records


increases not just for individual users but across the entire platform.
This rapid expansion means Netflix must continuously scale its data
storage systems.

When Netflix introduces new content and features, people spend more
time watching. With the rise of binge-watching and higher-quality
streaming (like 4K videos), the amount of viewing data per user is also
increasing.

A failure to manage this data properly could lead to catastrophic user


experience delays when loading watch history. It could also result in useless
recommendations, or even lost progress in shows.

In this article, we’ll learn how Netflix tackled these problems and improved
their storage system to handle millions of hours of viewing data every day.

The Initial Approach


When Netflix first started handling large amounts of viewing data, they chose
Apache Cassandra® for the following reasons:

Apache Cassandra® allows for a flexible structure, where each row can
store a growing number of viewing records without performance issues.

1/10
Netflix’s system processes significantly more writes (data being stored)
than reads (data being retrieved). The ratio is approximately 9:1,
meaning for every 9 new records added, only 1 is read. Apache
Cassandra® excels in handling such workloads.

Instead of enforcing strict consistency, Netflix prioritizes availability and


speed, ensuring users always have access to their watch history, even
if data updates take a little longer to sync. Apache Cassandra®
supports this tradeoff with eventual consistency, meaning that all copies
of data will eventually match up across the system.

See the diagram below, which shows the data model of Apache Cassandra®
using column families.

To structure the data efficiently, Netflix designed a simple yet scalable storage
model in Cassandra®.

Each user’s viewing history was stored under their unique ID (CustomerId).
Every viewing record (such as a movie or TV show watched) was stored in a
separate column under that user’s ID. To handle millions of users, Netflix
used "horizontal partitioning," meaning data was spread across multiple
servers based on CustomerId. This ensured that no single server was
overloaded.

Reads and Writes in the Initial System

2/10
The diagram below shows how the initial system handled reads and writes to
the viewing history data.

Every time a user started watching a show or movie, Netflix added a new
column to their viewing history record in the database. If the user paused or
stopped watching, that same column was updated to reflect their latest
progress.

While storing data was easy, retrieving it efficiently became more challenging
as users' viewing histories grew. Netflix used three different methods to fetch
data, each with its advantages and drawbacks:

Retrieving the Entire Viewing History: If a user had only watched a


few shows, Netflix could quickly fetch their entire history in one request.
However, for long-time users with large viewing histories, this approach
became slow and inefficient as more data accumulated.

Searching by Time Range: Sometimes, Netflix only needed to fetch


records from a specific time period, like a user’s viewing history from
the last month. While this method worked well in some cases,
performance varied depending on how many records were stored within
the selected time range.

Using Pagination for Large Histories: To avoid loading huge amounts


of data at once, Netflix used pagination. This approach prevented
timeouts (when a request takes too long and fails), but it also increased
the overall time needed to retrieve all records.

At first, this system worked well because it provided a fast and scalable way
to store viewing history. However, as more users watched more content, this
system started to hit performance limits. Some of the issues were as follows:

Too Many SSTables: Apache Cassandra® stores data in files called


SSTables (Sorted String Tables). Over time, the number of these
SSTables increased significantly. For every read, the system had to
scan multiple SSTables across the disk, making the process slower.

Compaction Overhead: Apache Cassandra® performs compactions to


merge multiple SSTables into fewer, more efficient ones. With more

3/10
data, these compactions took longer and required more processing
power. Other operations like read repair and full column repair also
became expensive.

To speed up data retrieval, Netflix introduced a caching solution called


EVCache.

Instead of reading everything from the Apache Cassandra® database every


time, each user’s viewing history is stored in a cache in a compressed format.
When a user watches a new show, their viewing data is added to Apache
Cassandra® and merged with the cached value in EVCache. If a user’s
viewing history is not found in the cache, Netflix fetches it from Cassandra®,
compresses it, and then stores it in EVCache for future use.

By adding EVCache, Netflix significantly reduced the load on their Apache


Cassandra® database. However, this solution also had its limits.

The New Approach: Live & Compressed


Storage Model
One important fact about the viewing history data was this: most users
frequently accessed only their recent viewing history, while older data was
rarely needed. However, storing everything the same way led to unnecessary
performance and storage costs.

To solve this, Netflix redesigned its storage model by splitting viewing history
into two categories:

Live Viewing History (LiveVH): It stores recently watched content that


users access frequently. It is also uncompressed for fast reads and
writes. This is designed for quick updates, like when a user pauses or
resumes a show.

Compressed Viewing History (CompressedVH): Stores older viewing


records that are accessed less frequently. It is compressed to save
storage space and improve performance.

4/10
Since LiveVH and CompressedVH serve different purposes, they were tuned
differently to maximize performance.

For LiveVH, which stores recent viewing records, Netflix prioritized speed and
real-time updates. Frequent compactions were performed to clean up old
data and keep the system running efficiently. Additionally, a low GC (Garbage
Collection) grace period was set, meaning outdated records were removed
quickly to free up space. Since this data was accessed often, frequent read
repairs were implemented to maintain consistency, ensuring that users
always saw accurate and up-to-date viewing progress.

On the other hand, CompressedVH, which stores older viewing records, was
optimized for storage efficiency rather than speed. Since this data was rarely
updated, fewer compactions were needed, reducing unnecessary processing
overhead. Read repairs were also performed less frequently, as data
consistency was less critical for archival records. The most significant
optimization was compressing the stored data, which drastically reduced the
storage footprint while still making older viewing history accessible when
needed.

The Large View History Performance Issue


Even with compression, some users had extremely large viewing histories.
For these users, reading and writing a single massive compressed file
became inefficient.

If CompressedVH grows too large, retrieving data becomes slow. Also, single
large files create performance bottlenecks when read or written. To avoid
these issues, Netflix introduced chunking, where large compressed data is
split into smaller parts and stored across multiple Apache Cassandra®
database nodes.

Here’s how it works:

Instead of storing everything in one large file, Netflix stores metadata


that tracks data version and chunk count.

Instead of writing one large file, each chunk is written separately. A


metadata entry keeps track of all the chunks, ensuring efficient storage.

Metadata is read first, so Netflix knows how many chunks to retrieve.


Chunks are fetched in parallel, significantly improving read speeds.

This system gave Netflix the headroom needed to handle future growth.

The New Challenges


With the global expansion of Netflix, the company launched its service in
130+ new countries and introduced support for 20 languages. This led to a
massive surge in data storage and retrieval needs.

At the same time, Netflix introduced video previews in the user interface (UI),
a feature that allowed users to watch short clips before selecting a title. While
this improved the browsing experience, it also dramatically increased the
volume of time-series data being stored.

5/10
When Netflix analyzed the performance of its system, it found several
inefficiencies:

The storage system did not differentiate between different types of


viewing data. Video previews and full-length plays were stored
identically, despite having very different usage patterns.

Language preferences were stored repeatedly across multiple viewing


records, leading to unnecessary duplication and wasted storage.

Since video previews generated many short playback records, they


accounted for 30% of the data growth every quarter.

Netflix’s clients (such as apps and web interfaces) retrieved more data
than they needed. Most queries only required recent viewing data, but
the system fetched entire viewing histories regardless. Because filtering
happened after data was fetched, huge amounts of unnecessary data
were transferred across the network, leading to high bandwidth costs
and slow performance.

Read latencies at the 99th percentile (worst-case scenarios) became


highly unpredictable. Some queries were fast, but others took too long
to complete, causing an inconsistent user experience.

At this stage, Netflix’s existing architecture could no longer scale efficiently.

Apache Cassandra® had been a solid choice for scalability, but by this stage,
Netflix was already operating one of the largest Apache Cassandra® clusters
in existence. The company had already pushed Apache Cassandra® to its
limits, and without a new approach, performance issues would continue to
worsen as the platform grew.

A more fundamental redesign was needed.

Netflix’s New Storage Architecture


To overcome the growing challenges of data overload, inefficient retrieval,
and high latency, Netflix introduced a new storage architecture that
categorized and stored data more intelligently.

Step 1: Categorizing Data by Type


One of the biggest inefficiencies in the previous system was that all viewing
data (whether it was a full movie playback, a short video preview, or a
language preference setting) was stored together.

Netflix solved this by splitting viewing history into three separate categories,
each with its dedicated storage cluster:

Full Title Plays: Data related to complete or partial streaming of


movies and TV shows. This is the most critical data because it affects
resume points, recommendations, and user engagement metrics.

Video Previews: Short clips that users watch while browsing content.
Since this data grows quickly but isn’t as important as full plays, it
required a different storage strategy.

6/10
Language Preferences: Information on which subtitles or audio tracks
a user selects. This was previously stored redundantly across multiple
viewing records, wasting storage. Now, it is stored separately and
referenced when needed.

Step 2: Sharding Data for Better Performance


Netflix sharded (split) its data across multiple clusters based on data type and
data age, improving both storage efficiency and query performance.

See the diagram below for reference:

1 - Sharding by Data Type

Each of the three data categories (Full Plays, Previews, and Language
Preferences) was assigned its separate cluster.

This allowed Netflix to tune each cluster differently based on how frequently
the data was accessed and how long it needed to be stored. It also prevented
one type of data from overloading the entire system.

2 - Sharding by Data Age

Not all data is accessed equally. Users frequently check their recent viewing
history, but older data is rarely needed. To optimize for this, Netflix divided its
data storage into three time-based clusters:

Recent Cluster (Short-Term Data): Stores viewing data from the past
few days or weeks. Optimized for fast reads and writes since recent
data is accessed most frequently.

Past Cluster (Archived Data): Holds viewing records from the past few
months to a few years. Contains detailed records, but is tuned for
slower access since older data is less frequently requested.

7/10
Historical Cluster (Summarized Data for Long-Term Storage):
Stores compressed summaries of viewing history from many years ago.
Instead of keeping every detail, this cluster only retains key information,
reducing storage size.

Optimizations in the New Architecture


To keep the system efficient, scalable, and cost-effective, the Netflix
engineering team introduced several key optimizations.

1 - Improving Storage Efficiency


With the introduction of video previews and multi-language support, Netflix
had to find ways to reduce the amount of unnecessary data stored. The
following strategies were adopted:

Many previews were only watched for a few seconds, which wasn’t a
strong enough signal of user interest. Instead of storing every short
preview play, Netflix filtered out these records before they were written
to the database, significantly reducing storage overhead.

Previously, language preference data (subtitles/audio track choices)


was duplicated across multiple viewing records. This was inefficient, so
Netflix changed its approach. Now, each user's language preference is
stored only once, and only changes (deltas) are recorded when the user
selects a different language.

To keep the database from being overloaded, Netflix implemented


Time-To-Live (TTL)-based expiration, meaning that unneeded records
were deleted automatically after a set period.

Netflix continued using two separate storage methods to store recent


data in an uncompressed format and older data in compressed format.

2 - Making Data Retrieval More Efficient

Instead of fetching all data at once, the system was redesigned to retrieve
only what was needed:

Recent data requests were directed to the Recent Cluster to support


low-latency access for users checking their recently watched titles.

Historical data retrieval was improved with parallel reads. Instead of


pulling old records from a single location, Netflix queried multiple
clusters in parallel, making data access much faster.

Records from different clusters were intelligently stitched together. Even


when a user requested their entire viewing history, the system
combined data from various sources quickly and efficiently.

3 - Automating Data Movement with Data Rotation

To keep the system optimized, Netflix introduced a background process that


automatically moved older data to the appropriate storage location:

8/10
Recent to Past Cluster: If a viewing record was older than a certain
threshold, it was moved from the Recent Cluster to the Past Cluster.
This ensured that frequently accessed data stayed in a fast-access
storage area, while older data was archived more efficiently.

Past to Historical Cluster: When data became older, it was


summarized and compressed, then moved to the Historical Cluster for
long-term storage with minimal resource usage.

Parallel writes & chunking were used to ensure that moving large
amounts of data didn’t slow down the system.

See the diagram below:

4 - Caching for Faster Data Access


Netflix restructured its EVCache (in-memory caching layer) to mirror the
backend storage architecture.

9/10
A new summary cache cluster was introduced, storing precomputed
summaries of viewing data for most users. This meant that instead of
computing summaries every time a user made a request, Netflix could fetch
them instantly from the cache. They managed to achieve a 99% cache hit
rate, meaning that nearly all requests were served from memory rather than
querying Apache Cassandra®, reducing overall database load.

Conclusion
With the growth that Netflix went through, their engineering team had to
evolve the time-series data storage system to meet the increasing demands.

Initially, Netflix relied on a simple Apache Cassandra®-based architecture,


which worked well for early growth but struggled as data volume soared. The
introduction of video previews, global expansion, and multilingual support
pushed the system to its breaking point, leading to high latency, storage
inefficiencies, and unnecessary data retrieval.

To address these challenges, Netflix redesigned its architecture by


categorizing data into Full Title Plays, Video Previews, and Language
Preferences and sharding data by type and age. This allowed for faster
access to recent data and efficient storage for older records.

These innovations allowed Netflix to scale efficiently, reducing storage costs,


improving data retrieval speeds, and ensuring a great streaming experience
for millions of users worldwide.

Note: Apache Cassandra® is a registered trademark of the Apache Software


Foundation.

10/10

You might also like