Why System Design
Why System Design
Any candidate who does not have experience in building systems might
think such question grossly unfair. On top of that, there generally isn’t
any one correct answer to such questions. The way you are answering the
question would sufficiently tell upon your professional competence and
background experience. That is the thing which the interviewer will
evaluate you on.
Try to learn from the existing systems: How have these been designed?
Another important point to be kept in mind is that, the interviewer
expects that candidate’s analytical ability and questioning on the
problem must comparable to his/her experience. If you have a few years
of software development experience, you are expected to have certain
knowledge and should avoid divulging into asking basic questions that
might have been appropriate coming from a fresh graduate. For that, you
should prepare sufficiently ahead of time. Try to go through real projects
and practices, well in advance of the interview as most questions are
based on real-life products, issues and challenges.
3. Summary
Solving system design questions could be broken down into three steps:
4. Conclusion
Design interviews are formidable, open-ended problems that cannot be
solved in the allotted time. Therefore, you should try to understand what
your interviewer intends to focus on and spend sufficient time on it. Be
well aware of the fact that the discussion on system design problem could
go in different directions depending on the preferences of the
interviewer. The interviewers might be unwilling to see how you create a
high-level architecture covering all aspects of the system or they could be
interested in looking for specific areas and diving deep into them. This
means that you must deal with the situation strategically as there are
chances of even the good candidates failing the interview, not because
they don’t have the knowledge, but because they lack the ability to focus
on the right things while discussing the problem.
If you have no idea how to solve these kinds of problems, you can
familiarize yourself with the common patterns of system design by
reading diversely from the blogs, watching videos of tech talks from
conferences. It is also advisable to arrange discussions and even mock
interviews with experienced engineers at big tech companies.
Consistent Hashing
index = hash_function(key)
Suppose we are designing a distributed caching system. Given ‘n’ cache
servers, an intuitive hash function would be ‘key % n’. It is simple and
commonly used. But it has two major drawbacks:
In Consistent Hashing when the hash table is resized (e.g. a new cache
host is added to the system), only k/n keys need to be remapped, where k
is the total number of keys and n is the total number of servers. Recall
that in a caching system using the ‘mod’ as the hash function, all keys
need to be remapped.
How it works?
To add a new server, say D, keys that were originally residing at C will be
split. Some of them will be shifted to D, while other keys will not be
touched.
To remove a cache or if a cache failed, say A, all keys that were originally
mapping to A will fall into B, and only those keys need to be moved to B,
other keys will not be affected.
CAP Theorem
Consistency: All nodes see the same data at the same time. Consistency is
achieved by updating several nodes before allowing further reads.
Load Balancing
To utilize full scalability and redundancy, we can try to balance the load
at each layer of the system. We can add LBs at three places:
1. Smart Clients
A smart client will take a pool of service hosts and balances load across
them. It also detects hosts that are not responding to avoid sending
requests their way. Smart clients also have to detect recovered hosts, deal
with adding new hosts, etc.
Adding load-balancing functionality into the database (cache, service,
etc.) client is usually an attractive solution for the developer. It looks
easy to implement and manage especially when the system is not large.
But as the system grows, LBs need to be evolved into standalone servers.
As such, even large companies with large budgets will often avoid using
dedicated hardware for all their load-balancing needs. Instead, they use
them only as the first point of contact for user requests to their
infrastructure and use other mechanisms (smart clients or the hybrid
approach discussed in the next section) for load-balancing for traffic
within their network.
For most systems, we should start with a software load balancer and
move to smart clients or hardware load balancing .
Caching
Load balancing helps you scale horizontally across an ever-increasing
number of servers, but caching will enable you to make vastly better use
of the resources you already have, as well as making otherwise
unattainable product requirements feasible. Caches take advantage of
the locality of reference principle: recently requested data is likely to be
requested again. They are used in almost every layer of computing:
hardware, operating systems, web browsers, web applications and more.
A cache is like short-term memory: it has a limited amount of space, but
is typically faster than the original data source and contains the most
recently accessed items. Caches can exist at all levels in architecture but
are often found at the level nearest to the front end, where they are
implemented to return data quickly without taxing downstream levels.
Placing a cache directly on a request layer node enables the local storage
of response data. Each time a request is made to the service, the node
will quickly return local, cached data if it exists. If it is not in the cache,
the requesting node will query the data from disk. The cache on one
request layer node could also be located both in memory (which is very
fast) and on the node’s local disk (faster than going to network storage).
What happens when you expand this to many nodes? If the request layer
is expanded to multiple nodes, it’s still quite possible to have each node
host its own cache. However, if your load balancer randomly distributes
requests across the nodes, the same request will go to different nodes,
thus increasing cache misses. Two choices for overcoming this hurdle are
global caches and distributed caches.
2. Distributed cache
In a distributed cache, each of its nodes own part of the cached data.
Typically, the cache is divided up using a consistent hashing function,
such that if a request node is looking for a certain piece of data, it can
quickly know where to look within the distributed cache to determine if
that data is available. In this case, each node has a small piece of the
cache, and will then send a request to another node for the data before
going to the origin. Therefore, one of the advantages of a distributed
cache is the increased cache space that can be had just by adding nodes
to the request pool.
3. Global Cache
A global cache is just as it sounds: all the nodes use the same single cache
space. This involves adding a server, or file store of some sort, faster than
your original store and accessible by all the request layer nodes. Each of
the request nodes queries the cache in the same way it would a local one.
This kind of caching scheme can get a bit complicated because it is very
easy to overwhelm a single cache as the number of clients and requests
increase, but is very effective in some architectures (particularly ones
with specialized hardware that make this global cache very fast, or that
have a fixed dataset that needs to be cached).
There are two common forms of global caches depicted in the following
diagram. First, when a cached response is not found in the cache, the
cache itself becomes responsible for retrieving the missing piece of data
from the underlying store. Second, it is the responsibility of request
nodes to retrieve any data that is not found in the cache.
Most applications leveraging global caches tend to use the first type,
where the cache itself manages eviction and fetching data to prevent a
flood of requests for the same data from the clients. However, there are
some cases where the second implementation makes more sense. For
example, if the cache is being used for very large files, a low cache hit
percentage would cause the cache buffer to become overwhelmed with
cache misses; in this situation, it helps to have a large percentage of the
total data set (or hot data set) in the cache. Another example is an
architecture where the files stored in the cache are static and shouldn’t
be evicted. (This could be because of application requirements around
that data latency—certain pieces of data might need to be very fast for
large data sets—where the application logic understands the eviction
strategy or hot spots better than the cache.)
CDNs are a kind of cache that comes into play for sites serving large
amounts of static media. In a typical CDN setup, a request will first ask
the CDN for a piece of static media; the CDN will serve that content if it
has it locally available. If it isn’t available, the CDN will query the back-
end servers for the file and then cache it locally and serve it to the
requesting user.
If the system we are building isn’t yet large enough to have its own CDN,
we can ease a future transition by serving the static media off a separate
subdomain (e.g. static.yourservice.com) using a lightweight HTTP server
like Nginx, and cutover the DNS from your servers to a CDN later.
Cache Invalidation
Write-through cache: Under this scheme data is written into the cache
and the corresponding database at the same time. The cached data
allows for fast retrieval, and since the same data gets written in the
permanent storage, we will have complete data consistency between
cache and storage. Also, this scheme ensures that nothing will get lost in
case of a crash, power failure, or other system disruptions.
Although write through minimizes the risk of data loss, since every write
operation must be done twice before returning success to the client, this
scheme has the disadvantage of higher latency for write operations.
Write-back cache: Under this scheme, data is written to cache alone, and
completion is immediately confirmed to the client. The write to the
permanent storage is done after specified intervals or under certain
conditions. This results in low latency and high throughput for write-
intensive applications, however, this speed comes with the risk of data
loss in case of a crash or other adverse event because the only copy of the
written data is in the cache.
1. First In First Out (FIFO): The cache evicts the first block accessed
first without any regard to how often or how many times it was
accessed before.
2. Last In First Out (LIFO): The cache evicts the block accessed most
recently first without any regard to how often or how many times it
was accessed before.
3. Least Recently Used (LRU): Discards the least recently used items
first.
4. Most Recently Used (MRU): Discards, in contrast to LRU, the most
recently used items first.
5. Least Frequently Used (LFU): Counts how often an item is needed.
Those that are used least often are discarded first.
6. Random Replacement (RR): Randomly selects a candidate item
and discards it to make space when necessary.
Data Partitioning
Data partitioning (also known as sharding) is a technique to break up a
big database (DB) into many smaller parts. It is the process of splitting
up a DB/table across multiple machines to improve the manageability,
performance, availability and load balancing of an application. The
justification for data sharding is that, after a certain scale point, it is
cheaper and more feasible to scale horizontally by adding more machines
than to grow it vertically by adding beefier servers.
1. Partitioning Methods
There are many different schemes one could use to decide how to break
up an application database into multiple smaller DBs. Below are three of
the most popular schemes used by various large scale applications.
The key problem with this approach is that if the value whose range is
used for sharding isn’t chosen carefully, then the partitioning scheme
will lead to unbalanced servers. In the previous example, splitting
location based on their zip codes assumes that places will be evenly
distributed across the different zip codes. This assumption is not valid as
there will be a lot of places in a thickly populated area like Manhattan
compared to its suburb cities.
1. The data distribution is not uniform, e.g., there are a lot of places
for a particular ZIP code, that cannot fit into one database
partition.
2. There are a lot of load on a shard, e.g., there are too many requests
being handled by the DB shard dedicated to user photos.
Indexes
Indexes are well known when it comes to databases; they are used to
improve the speed of data retrieval operations on the data store. An
index makes the trade-offs of increased storage overhead, and slower
writes (since we not only have to write the data but also have to update
the index) for the benefit of faster reads. Indexes are used to quickly
locate data without having to examine every row in a database table.
Indexes can be created using one or more columns of a database table,
providing the basis for both rapid random lookups and efficient access of
ordered records.
Imagine there is a request for the same data across several nodes, and
that piece of data is not in the cache. If these requests are routed through
the proxy, then all them can be collapsed into one, which means we will
be reading the required data from the disk only once.
Another great way to use the proxy is to collapse requests for data that is
spatially close together in the storage (consecutively on disk). This
strategy will result in decreasing request latency. For example, let’s say a
bunch of servers request parts of file: part1, part2, part3, etc. We can set
up our proxy in such a way that it can recognize the spatial locality of the
individual requests, thus collapsing them into a single request and
reading complete file, which will greatly minimize the reads from the
data origin. Such scheme makes a big difference in request time when we
are doing random accesses across TBs of data. Proxies are particularly
useful under high load situations, or when we have limited caching since
proxies can mostly batch several requests into one.
Queues
Queues are also used for fault tolerance as they can provide some
protection from service outages and failures. For example, we can create
a highly robust queue that can retry service requests that have failed due
to transient system failures. It is preferable to use a queue to enforce
quality-of-service guarantees than to expose clients directly to
intermittent service outages, requiring complicated and often
inconsistent client-side error handling.
Failover
In the world of databases, there are two main types of solutions: SQL and
NoSQL - or relational databases and non-relational databases. Both of
them differ in the way they were built, the kind of information they store,
and how they store it.
SQL
Relational databases store data in rows and columns. Each row contains
all the information about one entity, and columns are all the separate
data points. Some of the most popular relational databases are MySQL,
Oracle, MS SQL Server, SQLite, Postgres, MariaDB, etc.
NoSQL
Following are most common types of NoSQL:
Graph Databases: These databases are used to store data whose relations
are best represented in a graph. Data is saved in graph structures with
nodes (entities), properties (information about the entities) and lines
(connections between the entities). Examples of graph database include
Neo4J and InfiniteGraph.
NoSQL databases have different data storage models. The main ones are
key-value, document, graph and columnar. We will discuss differences
between these databases below.
When all the other components of our application are fast and seamless,
NoSQL databases prevent data from being the bottleneck. Big data is
contributing to a large success for NoSQL databases, mainly because it
handles data differently than the traditional relational databases. A few
popular examples of NoSQL databases are MongoDB, CouchDB,
Cassandra, and HBase.
1. Storing large volumes of data that often have little to no structure.
A NoSQL database sets no limits on the types of data we can store
together and allows us to add different new types as the need
changes. With document-based databases, you can store data in
one place without having to define what “types” of data those are in
advance.
2. Making the most of cloud computing and storage. Cloud-based
storage is an excellent cost-saving solution but requires data to be
easily spread across multiple servers to scale up. Using commodity
(affordable, smaller) hardware on-site or in the cloud saves you the
hassle of additional software, and NoSQL databases like Cassandra
are designed to be scaled across multiple data centers out of the
box without a lot of headaches.
3. Rapid development. NoSQL is extremely useful for rapid
development as it doesn’t need to be prepped ahead of time. If
you’re working on quick iterations of your system which require
making frequent updates to the data structure without a lot of
downtime between versions, a relational database will slow you
down.
Just like coding interviews, candidates who haven’t spent time preparing
for SDIs mostly perform poorly. This gets aggravated when you’re
interviewing at the top companies like Google, Facebook, Uber, etc. In
these companies, if a candidate doesn’t perform above average, they have
a limited chance to get an offer. On the other hand, a good performance
always results in a better offer (higher position and salary), since it
reflects your ability to handle complex systems.
Always ask questions to find the exact scope of the problem you’re
solving. Design questions are mostly open-ended, and they don’t have
ONE correct answer, that’s why clarifying ambiguities early in the
interview becomes critical. Candidates who spend enough time to clearly
define the end goals of the system, always have a better chance to be
successful in the interview. Also, since you only have 35-40 minutes to
design a (supposedly) large system, you should clarify what parts of the
system you would be focusing on.
● Will users of our service be able to post tweets and follow other
people?
● Should we also design to create and display user’s timeline?
● Will tweets contain photos and videos?
● Are we focusing on backend only or are we developing front-end
too?
● Will users be able to search tweets?
● Do we need to display hot trending topics?
● Would there be any push notification for new (or important)
tweets?
All such question will determine how our end design will look like.
Define what APIs are expected from the system. This would not only
establish the exact contract expected from the system but would also
ensure if you haven’t gotten any requirements wrong. Some examples for
our Twitter-like service would be:
It’s always a good idea to estimate the scale of the system you’re going to
design. This would also help later when you’ll be focusing on scaling,
partitioning, load balancing and caching.
Defining the data model early will clarify how data will flow among
different components of the system. Later, it will guide towards data
partitioning and management. Candidate should be able to identify
various entities of the system, how they will interact with each other and
different aspect of data management like storage, transportation,
encryption, etc. Here are some entities for our Twitter-like service:
Let's design a URL shortening service like TinyURL. This service will provide
short aliases redirecting to long URLs.
https://www.educative.io/collection/page/5668639101419520/5649050225344
512/5668600916475904/
We would get:
http://tinyurl.com/jlg8zpc
The shortened URL is nearly 1/3rd of the size of the actual URL.
Functional Requirements:
Non-Functional Requirements:
Extended Requirements:
Storage estimates: Since we expect to have 500M new URLs every month
and if we would be keeping these objects for five years; total number of
objects we will be storing would be 30 billion.
Let’s assume that each object we are storing can be of 500 bytes (just a
ballpark, we will dig into it later); we would need 15TB of total storage:
30 billion * 500 bytes = 15 TB
For read requests, since every second we expect ~19K URLs redirections,
total outgoing data for our service would be 9MB per second.
Memory estimates: If we want to cache some of the hot URLs that are
frequently accessed, how much memory would we need to store them? If
we follow the 80-20 rule, meaning 20% of URLs generating 80% of
traffic, we would like to cache these 20% hot URLs.
4. System APIs
Parameters:
Returns: (string)
deleteURL(api_dev_key, url_key)
Where “url_key” is a string representing the shortened URL to be
retrieved. A successful deletion returns ‘URL Removed’.
How do we detect and prevent abuse? For instance, any service can put
us out of business by consuming all our keys in the current design. To
prevent abuse, we can limit users through their api_dev_key, how many
URL they can create or access in a certain time.
5. Database Design
Database Schema:
We would need two tables, one for storing information about the URL
mappings and the other for users’ data.
What kind of database should we use? Since we are likely going to store
billions of rows and we don’t need to use relationships between objects –
a NoSQL key-value store like Dynamo or Cassandra is a better choice,
which would also be easier to scale. Please see SQL vs NoSQL for more
details. If we choose NoSQL, we cannot store UserID in the URL table
(as there are no foreign keys in NoSQL), for that we would need a third
table which will store the mapping between URL and the user.
The problem we are solving here is to generate a short and unique key
for the given URL. In the above-mentioned example, the shortened URL
we got was: “http://tinyurl.com/jlg8zpc”, the last six characters of this
URL is the short key we want to generate. We’ll explore two solutions
here:
We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given
URL. The hash can then be encoded for displaying. This encoding could
be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘-’ and ‘.’,
we can use base64 encoding. A reasonable question would be; what
should be the length of the short key? 6, 8 or 10 characters?
Using base64 encoding, a 6 letter long key would result in 64^6 ~= 68.7
billion possible strings
Using base64 encoding, an 8 letter long key would result in 64^8 ~= 281
trillion possible strings
With 68.7B unique strings, let’s assume for our system six letters keys
would suffice.
What are different issues with our solution? We have the following
couple of problems with our encoding scheme:
1. If multiple users enter the same URL, they can get the same
shortened URL, which is not acceptable.
2. What if parts of the URL are URL-encoded? e.g.,
http://www.educative.io/distributed.php?id=design, and
http://www.educative.io/distributed.php%3Fid%3Ddesign are
identical except for the URL encoding.
1 of 9
Servers can use KGS to read/mark keys in the database. KGS can use two
tables to store keys, one for keys that are not used yet and one for all the
used keys. As soon as KGS gives keys to one of the servers, it can move
them to the used keys table. KGS can always keep some keys in memory
so that whenever a server needs them, it can quickly provide them. For
simplicity, as soon as KGS loads some keys in memory, it can move them
to used keys table. This way we can make sure each server gets unique
keys. If KGS dies before assigning all the loaded keys to some server, we
will be wasting those keys, which we can ignore given a huge number of
keys we have. KGS also has to make sure not to give the same key to
multiple servers. For that, it must synchronize (or get a lock to) the data
structure holding the keys before removing keys from it and giving them
to a server.
What would be the key-DB size? With base64 encoding, we can generate
68.7B unique six letters keys. If we need one byte to store one alpha-
numeric character, we can store all these keys in:
Isn’t KGS the single point of failure? Yes, it is. To solve this, we can have
a standby replica of KGS, and whenever the primary server dies, it can
take over to generate and provide keys.
Can each app server cache some keys from key-DB? Yes, this can surely
speed things up. Although in this case, if the application server dies
before consuming all the keys, we will end up losing those keys. This
could be acceptable since we have 68B unique six letters keys.
How would we perform a key lookup? We can look up the key in our
database or key-value store to get the full URL. If it’s present, issue a
“302 Redirect” status back to the browser, passing the stored URL in the
“Location” field. If that key is not present in our system, issue a “404 Not
Found” status, or redirect the user back to the homepage.
The main problem with this approach is that it can lead to unbalanced
servers, for instance; if we decide to put all URLs starting with letter ‘E’
into a DB partition, but later we realize that we have too many URLs that
start with letter ‘E’, which we can’t fit into one DB partition.
8. Cache
We can cache URLs that are frequently accessed. We can use some off-
the-shelf solution like Memcache, that can store full URLs with their
respective hashes. The application servers, before hitting backend
storage, can quickly check if the cache has desired URL.
How much cache should we have? We can start with 20% of daily traffic
and based on clients’ usage pattern we can adjust how many cache
servers we need. As we estimated above we need 15GB memory to cache
20% of daily traffic that can easily fit into one server.
Which cache eviction policy would best fit our needs? When the cache is
full, and we want to replace a link with a newer/hotter URL, how would
we choose? Least Recently Used (LRU) can be a reasonable policy for our
system. Under this policy, we discard the least recently used URL first.
We can use a Linked Hash Map or a similar data structure to store our
URLs and Hashes, which will also keep track of which URLs are accessed
recently.
1 of 11
How many times a short URL has been used, what were user locations,
etc.? How would we store these statistics? If it is part of a DB row that
gets updated on each view, what will happen when a popular URL is
slammed with a large number of concurrent requests?
We can have statistics about the country of the visitor, date and time of
access, web page that refers the click, browser or platform from where
the page was accessed and more.
Can users create private URLs or allow a particular set of users to access
a URL?
Designing Pastebin
Let's design a Pastebin like web service, where users can store plain text. Users
of the service will enter a piece of text and get a randomly generated URL to
access it.
1. What is Pastebin?
Pastebin like services enable users to store plain text or images over the
network (typically the Internet) and generate unique URLs to access the
uploaded data. Such services are also used to share data over the
network quickly, as users would just need to pass the URL to let other
users see it.
If you haven’t used pastebin.com before, please try creating a new ‘Paste’
there and spend some time going through different options their service
offers. This will help you a lot in understanding this chapter better.
Functional Requirements:
Non-Functional Requirements:
1. The system should be highly reliable, any data uploaded should not
be lost.
2. The system should be highly available. This is required because if
our service is down, users will not be able to access their Pastes.
3. Users should be able to access their Pastes in real-time with
minimum latency.
4. Paste links should not be guessable (not predictable).
Extended Requirements:
What should be the limit on the amount of text user can paste at a time?
We can limit users not to have Pastes bigger than 10MB to stop the abuse
of the service.
Our services would be read heavy; there will be more read requests
compared to new Pastes creation. We can assume 5:1 ratio between read
and write.
If we want to store this data for ten years, we would need the total
storage capacity of 36TB.
With 1M pastes every day we will have 3.6 billion Pastes in 10 years. We
need to generate and store keys to uniquely identify these pastes. If we
use base64 encoding ([A-Z, a-z, 0-9, ., -]) we would need six letters
strings:
If it takes one byte to store one character, total size required to store 3.6B
keys would be:
3.6B * 6 => 22 GB
Although total ingress and egress are not big, we should keep these
numbers in mind while designing our service.
Memory estimates: We can cache some of the hot pastes that are
frequently accessed. Following 80-20 rule, meaning 20% of hot pastes
generate 80% of traffic, we would like to cache these 20% pastes
Since we have 5M read requests per day, to cache 20% of these requests,
we would need:
0.2 * 5M * 10KB ~= 10 GB
5. System APIs
Parameters:
api_dev_key (string): The API developer key of a registered account.
This will be used to, among other things, throttle users based on their
allocated quota.
Returns: (string)
A successful insertion returns the URL through which the paste can be
accessed, otherwise, returns an error code.
getPaste(api_dev_key, api_paste_key)
deletePaste(api_dev_key, api_paste_key)
A successful deletion returns ‘true’, otherwise returns ‘false’.
6. Database Design
Database Schema:
We would need two tables, one for storing information about the Pastes
and the other for users’ data.
At a high level, we need an application layer that will serve all the read
and write requests. Application layer will talk to a storage layer to store
and retrieve data. We can segregate our storage layer with one database
storing metadata related to each paste, users, etc., while the other storing
paste contents in some sort of block storage or a database. This division
of data will allow us to scale them individually.
8. Component Design
a. Application layer
Our application layer will process all incoming and outgoing requests.
The application servers will be talking to the backend data store
components to serve the requests.
Isn’t KGS single point of failure? Yes, it is. To solve this, we can have a
standby replica of KGS, and whenever the primary server dies, it can take
over to generate and provide keys.
Can each app server cache some keys from key-DB? Yes, this can surely
speed things up. Although in this case, if the application server dies
before consuming all the keys, we will end up losing those keys. This
could be acceptable since we have 68B unique six letters keys, which are
a lot more than we require.
How to handle a paste read request? Upon receiving a read paste
request, the application service layer contacts the datastore. The
datastore searches for the key, and if it is found, returns the paste’s
contents. Otherwise, an error code is returned.
b. Datastore layer
9. Purging or DB Cleanup
Please see Designing a URL Shortening service.
Designing Instagram
Let's design a photo-sharing service like Instagram, where users can upload
photos to share them with other users.
1. Why Instagram?
Functional Requirements
Non-functional Requirements
We need to store data about users, their uploaded photos, and people
they follow. Photo table will store all data related to a photo, we need to
have an index on (PhotoID, CreationDate), as we need to fetch recent
photos first.
One simple approach for storing the above schema would be to use an
RDBMS like MySQL since we require joins. But relational databases
come with their challenges, especially when we need to scale them. For
details, please take a look at SQL vs. NoSQL.
1. ‘Photos’ table to store the actual pictures. The ‘key’ would be the
'PhotoID’and ‘value’ would be the raw image data.
2. ‘PhotoMetadata’ to store all the metadata information about a
photo, where the ‘key’ would be the ‘PhotoID’ and ‘value’ would be
an object containing PhotoLocation, UserLocation,
CreationTimestamp, etc.
Since most of the web servers have connection limit, we should keep this
thing in mind before designing our system. Synchronous connection for
uploads, but downloads can be asynchronous. Let’s assume if a web
server can have maximum 500 connections at any time, and it can’t have
more than 500 concurrent uploads simultaneously. Since reads can be
asynchronous, the web server can serve a lot more than 500 users at any
time, as it can switch between users quickly. This guides us to have
separate dedicated servers for reads and writes so that uploads don’t hog
the system.
Separating image read and write requests will also allow us to scale or
optimize each of them independently.
8. Reliability and Redundancy
Losing files is not an option for our service. Therefore, we will store
multiple copies of each file, so that if one storage server dies, we can
retrieve the image from the other copy present on a different storage
server.
9. Data Sharding
So we will find the shard number by UserID % 200 and then store the
data there. To uniquely identify any photo in our system, we can append
shard number with each PhotoID.
How can we generate PhotoIDs? Each DB shard can have its own auto-
increment sequence for PhotoIDs, and since we will append ShardID
with each PhotoID, it will make it unique throughout our system.
1. How would we handle hot users? Several people follow such hot
users, and any photo they upload is seen by a lot of other people.
2. Some users will have a lot of photos compared to others, thus
making a non-uniform distribution of storage.
3. What if we cannot store all pictures of a user on one shard? If we
distribute photos of a user onto multiple shards, will it cause higher
latencies?
4. Storing all pictures of a user on one shard can cause issues like
unavailability of all of the user’s data if that shard is down or higher
latency if it is serving high load etc.
KeyGeneratingServer1:
auto-increment-increment = 2
auto-increment-offset = 1
KeyGeneratingServer2:
auto-increment-increment = 2
auto-increment-offset = 2
To create the timeline for any given user, we need to fetch the latest,
most popular and relevant photos of other people the user follows.
For simplicity, let’s assume we need to fetch top 100 photos for a user’s
timeline. Our application server will first get a list of people the user
follows and then fetches metadata info of latest 100 photos from each
user. In the final step, the server will submit all these photos to our
ranking algorithm which will determine the top 100 photos (based on
recency, likeness, etc.) to be returned to the user. A possible problem
with this approach would be higher latency, as we have to query multiple
tables and perform sorting/merging/ranking on the results. To improve
the efficiency, we can pre-generate the timeline and store it in a separate
table.
Pre-generating the timeline: We can have dedicated servers that are
continuously generating users’ timelines and storing them in a
‘UserTimeline’ table. So whenever any user needs the latest photos for
their timeline, we will simply query this table and return the results to
the user.
Whenever these servers need to generate the timeline of a user, they will
first query the UserTimeline table to see what was the last time the
timeline was generated for that user. Then, new timeline data will be
generated from that time onwards (following the abovementioned steps).
What are the different approaches for sending timeline data to the users?
1. Pull: Clients can pull the timeline data from the server on a regular
basis or manually whenever they need it. Possible problems with this
approach are a) New data might not be shown to the users until clients
issue a pull request b) Most of the time pull requests will result in an
empty response if there is no new data.
2. Push: Servers can push new data to the users as soon as it is available.
To efficiently manage this, users have to maintain a Long Poll request
with the server for receiving the updates. A possible problem with this
approach is when a user has a lot of follows or a celebrity user who has
millions of followers; in this case, the server has to push updates quite
frequently.
3. Hybrid: We can adopt a hybrid approach. We can move all the users
with high followings to pull based model and only push data to those
users who have a few hundred (or thousand) follows. Another approach
could be that the server pushes updates to all the users not more than a
certain frequency, letting users with a lot of follows/updates to regularly
pull data.
To create the timeline for any given user, one of the most important
requirements is to fetch latest photos from all people the user follows.
For this, we need to have a mechanism to sort photos on their time of
creation. This can be done efficiently if we can make photo creation time
part of the PhotoID. Since we will have a primary index on PhotoID, it
will be quite quick to find latest PhotoIDs.
We can use epoch time for this. Let’s say our PhotoID will have two
parts; the first part will be representing epoch seconds and the second
part will be an auto-incrementing sequence. So to make a new PhotoID,
we can take the current epoch time and append an auto incrementing ID
from our key generating DB. We can figure out shard number from this
PhotoID ( PhotoID % 200) and store the photo there.
What could be the size of our PhotoID? Let’s say our epoch time starts
today, how many bits we would need to store the number of seconds for
next 50 years?
86400 sec/day * 365 (days a year) * 50 (years) => 1.6 billion seconds
We would need 31 bits to store this number. Since on the average, we are
expecting 23 new photos per second; we can allocate 9 bits to store auto
incremented sequence. So every second we can store (2^9 => 512) new
photos. We can reset our auto incrementing sequence every second.
We will discuss more details about this technique under ‘Data Sharding’
in Designing Twitter.
Designing Dropbox
Let's design a file hosting service like Dropbox or Google Drive. Cloud file
storage enables users to store their data on remote servers. Usually, these
servers are maintained by cloud storage providers and made available to users
over a network (typically through the Internet). Users pay for their cloud data
storage on a monthly basis.
Cloud file storage services have become very popular recently as they
simplify the storage and exchange of digital resources among multiple
devices. The shift from using single personal computers to using multiple
devices with different platforms and operating systems such as
smartphones and tablets and their portable access from various
geographical locations at any time is believed to be accountable for the
huge popularity of cloud storage services. Some of the top benefits of
such services are:
Availability: The motto of cloud storage services is to have data
availability anywhere anytime. Users can access their files/photos from
any device whenever and wherever they like.
Scalability: Users will never have to worry about getting out of storage
space. With cloud storage, you have unlimited storage as long as you are
ready to pay for it.
What do we wish to achieve from a Cloud Storage system? Here are the
top-level requirements for our system:
1. Users should be able to upload and download their files/photos
from any device.
2. Users should be able to share files or folders with other users.
3. Our service should support automatic synchronization between
devices, i.e., after updating a file on one device, it should get
synchronized on all devices.
4. The system should support storing large files up to a GB.
5. ACID-ity is required. Atomicity, Consistency, Isolation and
Durability of all file operations should be guaranteed.
6. Our system should support offline editing. Users should be able to
add/delete/modify files while offline, and as soon as they come
online, all their changes should be synced to the remote servers and
other online devices.
Extended Requirements
The user will specify a folder as the workspace on their device. Any
file/photo/folder placed in this folder will be uploaded to the cloud, and
whenever a file is modified or deleted, it will be reflected in the same way
in the cloud storage. The user can specify similar workspaces on all their
devices and any modification done on one device will be propagated to
all other devices to have the same view of the workspace everywhere.
At a high level, we need to store files and their metadata information like
File Name, File Size, Directory, etc., and who this file is shared with. So,
we need some servers that can help the clients to upload/download files
to Cloud Storage and some servers that can facilitate updating metadata
about files and users. We also need some mechanism to notify all clients
whenever an update happens so they can synchronize their files.
As shown in the diagram below, Block servers will work with the clients
to upload/download files from cloud storage, and Metadata servers will
keep metadata of files updated in a SQL or NoSQL database.
Synchronization servers will handle the workflow of notifying all clients
about different changes for synchronization.
6. Component Design
a. Client
Based on the above considerations we can divide our client into following
four parts:
I. Internal Metadata Database will keep track of all the files, chunks,
their versions, and their location in the file system.
II. Chunker will split the files into smaller pieces called chunks. It will
also be responsible for reconstructing a file from its chunks. Our
chunking algorithm will detect the parts of the files that have been
modified by the user and only transfer those parts to the Cloud Storage;
this will save us bandwidth and synchronization time.
III. Watcher will monitor the local workspace folders and notify the
Indexer (discussed below) of any action performed by the users, e.g.,
when users create, delete, or update files or folders. Watcher also listens
to any changes happening on other clients that are broadcasted by
Synchronization service.
IV. Indexer will process the events received from the Watcher and
update the internal metadata database with information about the
chunks of the modified files. Once the chunks are successfully
submitted/downloaded to the Cloud Storage, the Indexer will
communicate with the remote Synchronization Service to broadcast
changes to other clients and update remote metadata database.
b. Metadata Database
1. Chunks
2. Files
3. User
4. Devices
5. Workspace (sync folders)
c. Synchronization Service
e. Cloud/Block Storage
Cloud/Block Storage stores chunks of files uploaded by the users. Clients
directly interact with the storage to send and receive objects from it.
Separation of the metadata from storage enables us to use any storage
either in cloud or in-house.
8. Data Deduplication
a. Post-process deduplication
9. Metadata Partitioning
The main problem with this approach is that it can lead to unbalanced
servers. For example, if we decide to put all files starting with letter ‘E’
into a DB partition, and later we realize that we have too many files that
start with letter ‘E’, to such an extent that we cannot fit them into one DB
partition.
This approach can still lead to overloaded partitions, which can be solved
by using Consistent Hashing.
10. Caching
We can have two kinds of caches in our system. To deal with hot
files/chunks, we can introduce a cache for Block storage. We can use an
off-the-shelf solution like Memcache, that can store whole chunks with
their respective IDs/Hashes, and Block servers before hitting Block
storage can quickly check if the cache has desired chunk. Based on
clients’ usage pattern we can determine how many cache servers we
need. A high-end commercial server can have up to 144GB of memory;
So, one such server can cache 36K chunks.
Which cache replacement policy would best fit our needs? When the
cache is full, and we want to replace a chunk with a newer/hotter chunk,
how would we choose? Least Recently Used (LRU) can be a reasonable
policy for our system. Under this policy, we discard the least recently
used chunk first.
We can add Load balancing layer at two places in our system 1) Between
Clients and Block servers and 2) Between Clients and Metadata servers.
Initially, a simple Round Robin approach can be adopted; that
distributes incoming requests equally among backend servers. This LB is
simple to implement and does not introduce any overhead. Another
benefit of this approach is if a server is dead, LB will take it out of the
rotation and will stop sending any traffic to it. A problem with Round
Robin LB is, it won’t take server load into consideration. If a server is
overloaded or slow, the LB will not stop sending new requests to that
server. To handle this, a more intelligent LB solution can be placed that
periodically queries backend server about their load and adjusts traffic
based on that.
One of primary concern users will have while storing their files in the
cloud would be the privacy and security of their data. Especially since in
our system users can share their files with other users or even make
them public to share it with everyone. To handle this, we will be storing
permissions of each file in our metadata DB to reflect what files are
visible or modifiable by any user.
Functional Requirements:
Non-functional Requirements:
Extended Requirements:
Let’s assume that we have 500 million daily active users and on average
each user sends 40 messages daily; this gives us 20 billion messages per
day.
Storage Estimation: Let’s assume that on average a message is 100 bytes,
so to store all the messages for one day we would need 2TB of storage.
Although Facebook Messenger stores all previous chat history, but just
for estimation to save five years of chat history, we would need 3.6
petabytes of storage.
Other than the chat messages, we would also need to store users’
information, messages’ metadata (ID, Timestamp, etc.). Also, above
calculations didn’t keep data compression and replication in
consideration.
At a high level, we will need a chat server that would be the central piece
orchestrating all the communications between users. When a user wants
to send a message to another user, they will connect to the chat server
and send the message to the server; the server then passes that message
to the other user and also stores it in the database.
The detailed workflow would look like this:
1 of 8
Let’s try to build a simple solution first where everything runs on one
server. At the high level our system needs to handle following use cases:
a. Messages Handling
1. Pull model: Users can periodically ask the server if there are any
new messages for them.
2. Push model: Users can keep a connection open with the server and
can depend upon the server to notify them whenever there are new
messages.
If we go with our first approach, then the server needs to keep track of
messages that are still waiting to be delivered, and as soon as the
receiving user connects to the server to ask for any new message, the
server can return all the pending messages. To minimize latency for the
user, they have to check the server quite frequently, and most of the time
they will be getting an empty response if there are no pending message.
This will waste a lot of resources and does not look like an efficient
solution.
If we go with our second approach, where all the active users keep a
connection open with the server, then as soon as the server receives a
message it can immediately pass the message to the intended user. This
way, the server does not need to keep track of pending messages, and we
will have minimum latency, as the messages are delivery instantly on the
opened connection.
How will clients maintain an open connection with the server? We can
use HTTP Long Polling. In long polling, clients can request information
from the server with the expectation that the server may not respond
immediately. If the server has no new data for the client when the poll is
received, instead of sending an empty response, the server holds the
request open and waits for response information to become available.
Once it does have new information, the server immediately sends the
response to the client, completing the open request. Upon receipt of the
server response, the client can immediately issue another server request
for future updates. This gives a lot of improvements in latencies,
throughputs, and performance. The long polling request can timeout or
can receive a disconnect from the server, in that case, the client has to
open a new request.
How can server keep track of all opened connection to efficiently redirect
messages to the users? The server can maintain a hash table, where “key”
would be the UserID and “value” would be the connection object. So
whenever the server receives a message for a user, it looks up that user in
the hash table to find the connection object and sends the message on
the open request.
What will happen when the server receives a message for a user who has
gone offline? If the receiver has disconnected, the server can notify the
sender about the delivery failure. If it is a temporary disconnect, e.g., the
receiver’s long-poll request just timed out, then we should expect a
reconnect from the user. In that case, we can ask the sender to retry
sending the message. This retry could be embedded in the client’s logic
so that users don’t have to retype the message. The server can also store
the message for a while and retry sending it once the receiver reconnects.
How many chat servers we need? Let’s plan for 500 million connections
at any time. Assuming a modern server can handle 500K concurrent
connections at any time, we would need 1K such servers.
How to know which server holds the connection to which user? We can
introduce a software load balancer in front of our chat servers; that can
map each UserID to a server to redirect the request.
How should the server process a ‘deliver message’ request? The server
needs to do following things upon receiving a new message 1) Store the
message in the database 2) Send the message to the receiver 3) Send an
acknowledgment to the sender.
The chat server will first find the server that holds the connection for the
receiver and pass the message to that server to send it to the receiver.
The chat server can then send the acknowledgment to the sender; we
don’t need to wait for storing the message in the database; this can
happen in the background. Storing the message is discussed in the next
section.
How does the messenger maintain the sequencing of the messages? We
can store a timestamp with each message, which would be the time when
the message is received at the server. But this will still not ensure correct
ordering of messages for clients. The scenario where the server
timestamp cannot determine the exact ordering of messages would look
like this:
So User-1 will see M1 first and then M2, whereas User-2 will see M2 first
and then M1.
How should clients efficiently fetch data from the server? Clients should
paginate while fetching data from the server. Page size could be different
for different clients, e.g., cell phones have smaller screens, so we need
lesser number of message/conversations in the viewport.
We need to keep track of user’s online/offline status and notify all the
relevant users whenever a status change happens. Since we are
maintaining a connection object on the server for all active users, we can
easily figure out user’s current status from this. With 500M active users
at any time, if we have to broadcast each status change to all the relevant
active users, it will consume a lot of resources. We can do the following
optimization around this:
1. Whenever a client starts the app, it can pull current status of all
users in their friends’ list.
2. Whenever a user sends a message to another user that has gone
offline, we can send a failure to the sender and update the status on
the client.
3. Whenever a user comes online, the server can always broadcast
that status with a delay of few seconds to see if the user does not go
offline immediately.
4. Client’s can pull the status from the server about those users that
are being shown on the user’s viewport. This should not be a
frequent operation, as the server is broadcasting the online status
of users and we can live with the stale offline status of users for a
while.
5. Whenever the client starts a new chat with another user, we can
pull the status at that time.
6. Data partitioning
Since we will be storing a lot of data (3.6PB for five years), we need to
distribute it onto multiple database servers. What would be our
partitioning scheme?
In the beginning, we can start with fewer database servers with multiple
shards residing on one physical server. Since we can have multiple
database instances on a server, we can easily store multiple partitions on
a single server. Our hash function needs to understand this logical
partitioning scheme so that it can map multiple logical partitions on one
physical server.
Since we will store an infinite history of messages, we can start with a big
number of logical partitions, which would be mapped to fewer physical
servers, and as our storage demand increases, we can add more physical
servers to distribute our logical partitions.
7. Cache
We can cache a few recent messages (say last 15) in a few recent
conversations that are visible in user’s viewport (say last 5). Since we
decided to store all of the user’s messages on one shard, cache for a user
should completely reside on one machine too.
8. Load balancing
We will need a load balancer in front of our chat servers; that can map
each UserID to a server that holds the connection for the user and then
direct the request to that server. Similarly, we would need a load
balancer for our cache servers.
What will happen when a chat server fails? Our chat servers are holding
connections with the users. If a server goes down, should we devise a
mechanism to transfer those connections to some other server? It’s
extremely hard to failover TCP connections to other servers; an easier
approach can be to have clients automatically reconnect if the connection
is lost.
a. Group chat
We can have separate group-chat objects in our system that can be
stored on the chat servers. A group-chat object is identified by
GroupChatID and will also maintain a list of people who are part of that
chat. Our load balancer can direct each group chat message based on
GroupChatID and the server handling that group chat can iterate
through all the users of the chat to find the server handling the
connection of each user to deliver the message.
b. Push notifications
In our current design user’s can only send messages to active users and if
the receiving user is offline, we send a failure to the sending user. Push
notifications will enable our system to send messages to offline users.
Push notifications only work for mobile clients. Each user can opt-in
from their device to get notifications whenever there is a new message or
event. Each mobile manufacturer maintains a set of servers that handles
pushing these notifications to the user’s device.
Let's design a Twitter like social networking service. Users of the service will be
able to post tweets, follow other people and favorite tweets.
1. What is Twitter?
Twitter is an online social networking service where users post and read
short 140-character messages called "tweets". Registered users can post
and read tweets, but those who are not registered can only read them.
Users access Twitter through their website interface, SMS or mobile app.
Functional Requirements
Non-functional Requirements
Extended Requirements
1. Searching tweets.
2. Reply to a tweet.
3. Trending topics – current hot topics/searches.
4. Tagging other users.
5. Tweet Notification.
6. Who to follow? Suggestions?
7. Moments.
Let’s assume we have one billion total users, with 200 million daily
active users (DAU). Also, we have 100 million new tweets every day, and
on average each user follows 200 people.
How many favorites per day? If on average each user favorites five tweets
per day, we will have:
How many total tweet-views our system will generate? Let’s assume on
average a user visits their timeline two times a day and visits five other
people’s pages. One each page if a user sees 20 tweets, total tweet-views
our system will generate:
What would be our storage needs for five years? How much storage we
would need for users’ data, follows, favorites? We will leave this for
exercise.
Not all tweets will have media, let’s assume that on average every fifth
tweet has a photo and every tenth has a video. Let’s also assume on
average a photo is 200KB and a video is 2MB. This will lead us to have
24TB of new media every day.
Bandwidth Estimates Since total ingress is 24TB per day, this would
translate into 290MB/sec.
Remember that we have 28B tweet views per day. We must show the
photo of every tweet (if it has a photo), but let’s assume that the users
watch every 3rd video they see in their timeline. So, total egress will be:
Total ~= 35GB/s
4. System APIs
Parameters:
Returns: (string)
A successful post will return the URL to access that tweet. Otherwise, an
appropriate HTTP error is returned.
We need a system that can efficiently store all the new tweets,
100M/86400s => 1150 tweets per second and read 28B/86400s =>
325K tweets per second. It is clear from the requirements that this will
be a read-heavy system.
6. Database Schema
We need to store data about users, their tweets, their favorite tweets, and
people they follow.
For choosing between SQL and NoSQL databases to store the above
schema, please see ‘Database schema’ under Designing Instagram.
7. Data Sharding
Since we have a huge number of new tweets every day and our read load
is extremely high too, we need to distribute our data onto multiple
machines such that we can read/write it efficiently. We have many
options to shard our data; let’s go through them one by one:
Sharding based on UserID: We can try storing all the data of a user on
one server. While storing, we can pass the UserID to our hash function
that will map the user to a database server where we will store all of the
user’s tweets, favorites, follows, etc. While querying for
tweets/follows/favorites of a user, we can ask our hash function where
can we find the data of a user and then read it from there. This approach
has a couple of issues:
1. Our application (app) server will find all the people the user
follows.
2. App server will send the query to all database servers to find tweets
from these people.
3. Each database server will find the tweets for each user, sort them
by recency and return the top tweets.
4. App server will merge all the results and sort them again to return
the top results to the user.
We can use epoch time for this. Let’s say our TweetID will have two
parts; the first part will be representing epoch seconds and the second
part will be an auto-incrementing sequence. So, to make a new TweetID,
we can take the current epoch time and append an auto-incrementing
number to it. We can figure out shard number from this TweetID and
store it there.
What could be the size of our TweetID? Let’s say our epoch time starts
today, how many bits we would need to store the number of seconds for
next 50 years?
1483228800 000001
1483228800 000002
1483228800 000003
1483228800 000004
If we make our TweetID 64bits (8 bytes) long, we can easily store tweets
for next 100 years and also store them for mili-seconds granularity.
8. Cache
We can introduce a cache for database servers to cache hot tweets and
users. We can use an off-the-shelf solution like Memcache that can store
the whole tweet objects. Application servers before hitting database can
quickly check if the cache has desired tweets. Based on clients’ usage
pattern we can determine how many cache servers we need.
Which cache replacement policy would best fit our needs? When the
cache is full, and we want to replace a tweet with a newer/hotter tweet,
how would we choose? Least Recently Used (LRU) can be a reasonable
policy for our system. Under this policy, we discard the least recently
viewed tweet first.
How can we have more intelligent cache? If we go with 80-20 rule, that
is 20% of tweets are generating 80% of read traffic which means that
certain tweets are so popular that majority of people read them. This
dictates that we can try to cache 20% of daily read volume from each
shard.
What if we cache the latest data? Our service can benefit from this
approach. Let’s say if 80% of our users see tweets from past three days
only; we can try to cache all the tweets from past three days. Let’s say we
have dedicated cache servers that cache all the tweets from all users from
past three days. As estimated above, we are getting 100 million new
tweets or 30GB of new data every day (without photos and videos). If we
want to store all the tweets from last three days, we would need less than
100GB of memory. This data can easily fit into one server, but we should
replicate it onto multiple servers to distribute all the read traffic to
reduce the load on cache servers. So whenever we are generating a user’s
timeline, we can ask the cache servers if they have all the recent tweets
for that user, if yes, we can simply return all the data from the cache. If
we don’t have enough tweets in the cache, we have to query backend to
fetch that data. On a similar design, we can try caching photos and
videos from last three days.
Our cache would be like a hash table, where ‘key’ would be ‘OwnerID’
and ‘value’ would be a doubly linked list containing all the tweets from
that user in past three days. Since we want to retrieve most recent data
first, we can always insert new tweets at the head of the linked list, which
means all the older tweets will be near the tail of the linked list.
Therefore, we can remove tweets from the tail to make space for newer
tweets.
9. Timeline Generation
For a detailed discussion about timeline generation, take a look at
Designing Facebook’s Newsfeed.
How to serve feeds? Get all the latest tweets from the people someone
follows and merge/sort them by time. Use pagination to fetch/show
tweets. Only fetch top N tweets from all the people someone follows. This
N will depend on the client’s Viewport, as on mobile we show fewer
tweets compared to a Web client. We can also cache next top tweets to
speed things up.
Who to follow? How to give suggestions? This feature will improve user
engagement. We can suggest friends of people someone follows. We can
go two or three level down to find famous people for the suggestions. We
can give preference to people with more followers.
Moments: Get top news for different websites for past 1 or 2 hours, figure
out related tweets, prioritize them, categorize them (news, support,
financials, entertainment, etc.) using ML – supervised learning or
Clustering. Then we can show these articles as trending topics in
Moments.
Search: Search involves Indexing, Ranking, and Retrieval of tweets. A
similar solution is discussed in our next problem Design Twitter Search.
Designing Youtube
Let's design a video sharing service like Youtube, where users will be able to
upload/view/search videos.
1. Why Youtube?
Youtube is one of the most popular video sharing websites in the world.
Users of the service can upload, view, share, rate, and report videos as
well as add comments on videos.
Functional Requirements:
Let’s assume we have 1.5 billion total users, 800 million of whom are
daily active users. If, on the average, a user views five videos per day,
total video-views per second would be:
Let’s assume our upload:view ratio is 1:200 i.e., for every video upload
we have 200 video viewed, giving us 230 videos uploaded per second.
Storage Estimates: Let’s assume that every minute 500 hours worth of
videos are uploaded to Youtube. If on average, one minute of video needs
50MB of storage (videos need to be stored in multiple formats), total
storage needed for videos uploaded in a minute would be:
500 hours * 60 min * 50MB => 1500 GB/min (25 GB/sec)
4. System APIs
recording_details, video_contents)
Parameters:
api_dev_key (string): The API developer key of a registered account.
This will be used to, among other things, throttle users based on their
allocated quota.
category_id (string): Category of the video, e.g., Film, Song, People, etc.
Returns: (string)
A successful upload will return HTTP 202 (request accepted), and once
the video encoding is completed, the user is notified through email with a
link to access the video. We can also expose a queryable API to let users
know the current status of their uploaded video.
searchVideo(api_dev_key, search_query, user_location,
maximum_videos_to_return, page_token)
Parameters:
page_token (string): This token will specify a page in the result set that
should be returned.
Returns: (JSON)
6. Database Schema
● CommentID
● VideoID
● UserID
● Comment
● TimeOfCreation
Let’s evaluate storing all the thumbnails on disk. Given that we have a
huge number of files; to read these files we have to perform a lot of seeks
to different locations on the disk. This is quite inefficient and will result
in higher latencies.
Bigtable can be a reasonable choice here, as it combines multiple files
into one block to store on the disk and is very efficient in reading a small
amount of data. Both of these are the two biggest requirements of our
service. Keeping hot thumbnails in the cache will also help in improving
the latencies, and given that thumbnails files are small in size, we can
easily cache a large number of such files in memory.
Video Encoding: Newly uploaded videos are stored on the server, and a
new task is added to the processing queue to encode the video into
multiple formats. Once all the encoding is completed; uploader is
notified, and video is made available for view/sharing.
Since we have a huge number of new videos every day and our read load
is extremely high too, we need to distribute our data onto multiple
machines so that we can perform read/write operations efficiently. We
have many options to shard our data. Let’s go through different
strategies of sharding this data one by one:
Sharding based on UserID: We can try storing all the data for a
particular user on one server. While storing, we can pass the UserID to
our hash function which will map the user to a database server where we
will store all the metadata for that user’s videos. While querying for
videos of a user, we can ask our hash function to find the server holding
user’s data and then read it from there. To search videos by titles, we will
have to query all servers, and each server will return a set of videos. A
centralized server will then aggregate and rank these results before
returning them to the user.
Sharding based on VideoID: Our hash function will map each VideoID to
a random server where we will store that Video’s metadata. To find
videos of a user we will query all servers, and each server will return a set
of videos. A centralized server will aggregate and rank these results
before returning them to the user. This approach solves our problem of
popular users but shifts it to popular videos.
9. Video Deduplication
For the end user, these inefficiencies will be realized in the form of
duplicate search results, longer video startup times, and interrupted
streaming.
For our service, deduplication makes most sense early, when a user is
uploading a video; as compared to post-processing it to find duplicate
videos later. Inline deduplication will save us a lot of resources that can
be used to encode, transfer and store the duplicate copy of the video. As
soon as any user starts uploading a video, our service can run video
matching algorithms (e.g., Block Matching, Phase Correlation, etc.) to
find duplications. If we already have a copy of the video being uploaded,
we can either stop the upload and use the existing copy or use the newly
uploaded video if it is of higher quality. If the newly uploaded video is a
subpart of an existing video or vice versa, we can intelligently divide the
video into smaller chunks, so that we only upload those parts that are
missing.
We should use Consistent Hashing among our cache servers, which will
also help in balancing the load between cache servers. Since we will be
using a static hash-based scheme to map videos to hostnames, it can lead
to uneven load on the logical replicas due to the different popularity for
each video. For instance, if a video becomes popular, the logical replica
corresponding to that video will experience more traffic than other
servers. These uneven loads for logical replicas can then translate into
uneven load distribution on corresponding physical servers. To resolve
this issue, any busy server in one location can redirect a client to a less
busy server in the same cache location. We can use dynamic HTTP
redirections for this scenario.
However, the use of redirections also has its drawbacks. First, since our
service tries to load balance locally, it leads to multiple redirections if the
host that receives the redirection can’t serve the video. Also, each
redirection requires a client to make an additional HTTP request; it also
leads to higher delays before the video starts playing back. Moreover,
inter-tier (or cross data-center) redirections lead a client to a distant
cache location because the higher tier caches are only present at a small
number of locations.
11. Cache
How can we build more intelligent cache? If we go with 80-20 rule, i.e.,
20% of daily read volume for videos is generating 80% of traffic,
meaning that certain videos are so popular that the majority of people
view them; It follows that we can try caching 20% of daily read volume of
videos and metadata.
Less popular videos (1-20 views per day) that are not cached by CDNs
can be served by our servers in various data centers.
13. Fault Tolerance
Let's design a real-time suggestion service, which will recommend terms to users
as they enter text for searching.
The problem we are solving is that we have a lot of ‘strings’ that we need
to store in such a way that users can search on any prefix. Our service
will suggest next terms that will match the given prefix. For example, if
our database contains following terms: cap, cat, captain, capital; and the
user has typed in ‘cap’, our system should suggest ‘cap’, ‘captain’ and
‘capital’.
One of the most appropriate data structure that can serve our purpose
would be the Trie (pronounced “try”). A trie is a tree-like data structure
used to store phrases where each node stores a character of the phrase in
a sequential manner. For example, if we need to store ‘cap, cat, caption,
captain, capital’ in the trie, it would look like:
Now if the user has typed ‘cap’, our service can traverse the trie to go to
the node ‘P’ to find all the terms that start with this prefix (e.g., cap-tion,
cap-ital etc).
We can merge nodes that have only one branch to save storage space.
The above trie can be stored like this:
Should we’ve case insensitive trie? For simplicity and search use case
let’s assume our data is case insensitive.
How to find top suggestion? Now that we can find all the terms given a
prefix, how can we know what’re the top 10 terms that we should
suggest? One simple solution could be to store the count of searches that
terminated at each node, e.g., if users have searched about ‘CAPTAIN’
100 times and ‘CAPTION’ 500 times, we can store this number with the
last character of the phrase. So now if the user has typed ‘CAP’ we know
the top most searched word under the prefix ‘CAP’ is ‘CAPTION’. So
given a prefix, we can traverse the sub-tree under it, to find the top
suggestions.
Given a prefix, how much time it can take to traverse its sub-tree? Given
the amount of data we need to index, we should expect a huge tree. Even,
traversing a sub-tree would take really long, e.g., the phrase ‘system
design interview questions’ is 30 levels deep. Since we’ve very tight
latency requirements, we do need to improve the efficiency of our
solution.
Can we store top suggestions with each node? This can surely speed up
our searches but will require a lot of extra storage. We can store top 10
suggestions at each node that we can return to the user. We’ve to bear
the big increase in our storage capacity to achieve the required efficiency.
How would we build this trie? We can efficiently build our trie bottom
up. Each parent node will recursively call all the child nodes to calculate
their top suggestions and their counts. Parent nodes will combine top
suggestions from all of their children to determine their top suggestions.
How to update the trie? Assume\ing five billion searches every day,
which would give us approximately 60K queries per second. If we try to
update our trie for every query it’ll be extremely resource intensive and
this can hamper our read requests too. One solution to handle this could
be to update our trie offline after a certain interval.
As the new queries come in, we can log them and also track their
frequencies. Either we can log every query or do sampling and log every
1000th query. For example, if we don’t want to show a term which is
searched for less than 1000 times, it’s safe to log every 1000th searched
term.
We can have a Map-Reduce (MR) setup to process all the logging data
periodically, say every hour. These MR jobs will calculate frequencies of
all searched terms in the past hour. We can then update our trie with this
new data. We can take the current snapshot of the trie and update it with
all the new terms and their frequencies. We should do this offline, as we
don’t want our read queries to be blocked by update trie requests. We
can have two options:
After inserting a new term in the trie, we’ll go to the terminal node of the
phrase and increase its frequency. Since we’re storing the top 10 queries
in each node, it is possible that this particular search term jumped into
the top 10 queries of a few other nodes. So, we need to update the top 10
queries of those nodes then. We’ve to traverse back from the node to all
the way up to the root. For every parent, we check if the current query is
part of the top 10. If so, we update the corresponding frequency. If not,
we check if the current query’s frequency is high enough to be a part of
the top 10. If so, we insert this new term and remove the term with the
lowest frequency.
How can we remove a term from the trie? Let’s say we’ve to remove a
term from the trie, because of some legal issue or hate or piracy etc. We
can completely remove such terms from the trie when the regular update
happens, meanwhile, we can add a filtering layer on each server, which
will remove any such term before sending them to users.
What could be different ranking criteria for suggestions? In addition to a
simple count, for terms ranking, we have to consider other factors too,
e.g., freshness, user location, language, demographics, personal history
etc.
How to store trie in a file so that we can rebuild our trie easily - this will
be needed when a machine restarts? We can take snapshot of our trie
periodically and store it in file. This will enable us to rebuild a trie if the
server goes down. To store, we can start with the root node and save the
trie level-by-level. With each node we can store what character it
contains and how many children it has. Right after each node we should
put all of its children. Let’s assume we have following trie:
If we store this trie in a file with the above-mentioned scheme, we will
have: “C2,A2,R1,T,P,O1,D”. From this, we can easily rebuild our trie.
If you’ve noticed we are not storing top suggestions and their counts with
each node, it is hard to store this information, as our trie is being stored
top down, we don’t have child nodes created before the parent, so there
is no easy way to store their references. For this, we have to recalculate
all the top terms with counts. This can be done while we are building the
trie. Each node will calculate its top suggestions and pass it to its parent.
Each parent node will merge results from all of its children to figure out
its top suggestions.
5. Scale Estimation
If we are building a service, which has the same scale as that of Google,
we should expect 5 billion searches every day, which would give us
approximately 60K queries per second.
We can expect some growth in this data every day, but we should also be
removing some terms that are not searched anymore. If we assume we
have 2% new queries every day and if we are maintaining our index for
last one year, total storage we should expect:
6. Data Partition
Although our index can easily fit on one server, we’ll partition it meet our
higher efficiency and lower latencies requirements. How can we
efficiently partition our data to distribute it onto multiple servers?
We can see that the above problem will happen with every statically
defined scheme. It is not possible to calculate if each of our partitions
will fit on one server statically.
Server 1, A-AABC
Server 2, AABD-BXA
Server 3, BXB-CDA
For querying, if the user has typed ‘A’ we have to query both server 1 and
2 to find the top suggestions. When the user has typed ‘AA’, still we have
to query server 1 and 2, but when the user has typed ‘AAA’ we only need
to query server 1.
We can have a load balancer in front of our trie servers, which can store
this mapping and redirect traffic. Also if we are querying from multiple
servers, either we need to merge the results at the server side to calculate
overall top results, or make our clients do that. If we prefer to do this on
the server side, we need to introduce another layer of servers between
load balancers and trie servers, let’s call them aggregator. These servers
will aggregate results from multiple trie servers and return the top
results to the client.
c. Partition based on the hash of the term: Each term will be passed to a
hash function, which will generate a server number and we will store the
term on that server. This will make our term distribution random and
hence minimizing hotspots. To find typeahead suggestions for a term, we
have to ask all servers and then aggregate the results. We have to use
consistent hashing for fault tolerance and load distribution.
7. Cache
We should realize that caching the top searched terms will be extremely
helpful in our service. There will be a small percentage of queries that
will be responsible for most of the traffic. We can have separate cache
servers in front of the trie servers, holding most frequently searched
terms and their typeahead suggestions. Application servers should check
these cache servers before hitting the trie servers to see if they have the
desired searched terms.
We can also build a simple Machine Learning (ML) model that can try to
predict the engagement on each suggestion based on simple counting,
personalization, or trending data etc., and cache these terms.
We should have replicas for our trie servers both for load balancing and
also for fault tolerance. We also need a load balancer that keeps track of
our data partitioning scheme and redirects traffic based on the prefixes.
9. Fault Tolerance
What will happen when a trie server goes down? As discussed above we
can have a master-slave configuration, if the master dies slave can take
over after failover. Any server that comes back up, can rebuild the trie
based on the last snapshot.
1. The client should only try hitting the server if the user has not
pressed any key for 50ms.
2. If the user is constantly typing, the client can cancel the in-progress
requests.
3. Initially, the client can wait until the user enters a couple of
characters.
4. Clients can pre-fetch some data from the server to save future
requests.
5. Clients can store the recent history of suggestions locally. Recent
history has a very high rate of being reused.
6. Establishing an early connection with server turns out to be one of
the most important factors. As soon as the user opens the search
engine website, the client can open a connection with the server. So
when user types in the first character, client doesn’t waste time in
establishing the connection.
7. The server can push some part of their cache to CDNs and Internet
Service Providers (ISPs) for efficiency.
11. Personalization
Twitter is one of the largest social networking service where users can share
photos, news, and text-based messages. In this chapter, we will design a service
that can store and search user tweets.
Twitter users can update their status whenever they like. Each status
consists of plain text, and our goal is to design a system that allows
searching over all the user statuses.
We need to design a system that can efficiently store and query user
statuses.
3. Capacity Estimation and Constraints
Storage Capacity: Since we have 400 million new statuses every day and
each status on average is 300 bytes, therefore total storage we need, will
be:
4. System APIs
Parameters:
page_token (string): This token will specify a page in the result set that
should be returned.
Returns: (JSON)
At the high level, we need to store all the statues in a database, and also
build an index that can keep track of which word appears in which
status. This index will help us quickly find statuses that users are trying
to search.
High level design for Twitter search
1. Storage: We need to store 112GB of new data every day. Given this
huge amount of data, we need to come up with a data partitioning
scheme that will be efficiently distributing it onto multiple servers. If we
plan for next five years, we will need following storage:
If we never want to be more than 80% full, we would need 240TB. Let’s
assume that we want to keep an extra copy of all the statuses for fault
tolerance, then our total storage requirement will be 480 TB. If we
assume a modern server can store up to 4TB of data, then we would need
120 such servers to hold all of the required data for next five years.
2. Index: What should our index look like? Since our status queries will
consist of words, therefore, let’s build our index that can tell us which
word comes in which status object. Let’s first estimate how big our index
will be. If we want to build an index for all the English words and some
famous nouns like people names, city names, etc., and if we assume that
we have around 300K English words and 200K nouns, then we will have
500k total words in our index. Let’s assume that the average length of a
word is five characters. If we are keeping our index in memory, we would
need 2.5MB of memory to store all the words:
So our index would be like a big distributed hash table, where ‘key’ would
be the word, and ‘value’ will be a list of StatusIDs of all those status
objects which contain that word. Assuming on the average we have 40
words in each status and since we will not be indexing prepositions and
other small words like ‘the’, ‘an’, ‘and’ etc., let’s assume we will have
around 15 words in each status that need to be indexed. This means each
StatusID will be stored 15 times in our index. So total memory will need
to store our index:
Sharding based on the status object: While storing, we will pass the
StatusID to our hash function to find the server and index all the words
of the status on that server. While querying for a particular word, we
have to query all the servers, and each server will return a set of
StatusIDs. A centralized server will aggregate these results to return
them to the user.
Detailed component design
7. Fault Tolerance
What will happen when an index server dies? We can have a secondary
replica of each server, and if the primary server dies it can take control
after the failover. Both primary and secondary servers will have the same
copy of the index.
What if both primary and secondary servers die at the same time? We
have to allocate a new server and rebuild the same index on it. How can
we do that? We don’t know what words/statuses were kept on this
server. If we were using ‘Sharding based on the status object’, the brute-
force solution would be to iterate through the whole database and filter
StatusIDs using our hash function to figure out all the required Statuses
that will be stored on this server. This would be inefficient and also
during the time when the server is being rebuilt we will not be able to
serve any query from it, thus missing some Statuses that should have
been seen by the user.
How can we efficiently retrieve a mapping between Statuses and index
server? We have to build a reverse index that will map all the StatusID to
their index server. Our Index-Builder server can hold this information.
We will need to build a Hashtable, where the ‘key’ would be the index
server number and the ‘value’ would be a HashSet containing all the
StatusIDs being kept at that index server. Notice that we are keeping all
the StatusIDs in a HashSet, this will enable us to add/remove Statuses
from our index quickly. So now whenever an index server has to rebuild
itself, it can simply ask the Index-Builder server for all the Statuses it
needs to store, and then fetch those statuses to build the index. This
approach will surely be quite fast. We should also have a replica of
Index-Builder server for fault tolerance.
8. Cache
To deal with hot status objects, we can introduce a cache in front of our
database. We can use Memcache , which can store all such hot status
objects in memory. Application servers before hitting backend database
can quickly check if the cache has that status object. Based on clients’
usage pattern we can adjust how many cache servers we need. For cache
eviction policy, Least Recently Used (LRU) seems suitable for our
system.
9. Load Balancing
We can add Load balancing layer at two places in our system 1) Between
Clients and Application servers and 2) Between Application servers and
Backend server. Initially, a simple Round Robin approach can be
adopted; that distributes incoming requests equally among backend
servers. This LB is simple to implement and does not introduce any
overhead. Another benefit of this approach is if a server is dead, LB will
take it out of the rotation and will stop sending any traffic to it. A
problem with Round Robin LB is, it won’t take server load into
consideration. If a server is overloaded or slow, the LB will not stop
sending new requests to that server. To handle this, a more intelligent LB
solution can be placed that periodically queries backend server about
their load and adjusts traffic based on that.
10. Ranking
How about if we want to rank the search results by social graph distance,
popularity, relevance, etc?
A web crawler is a software program which browses the World Wide Web
in a methodical and automated manner. It collects documents by
recursively fetching links from a set of starting pages. Many sites,
particularly search engines, use web crawling as a means of providing
up-to-date data. Search engines download all the pages to create an
index on them to perform faster searches.
● To test web pages and links for valid syntax and structure.
● To monitor sites to see when their structure or contents change.
● To maintain mirror sites for popular Web sites.
● To search for copyright infringements.
● To build a special-purpose index, e.g., one that has some
understanding of the content stored in multimedia files on the
Web.
Crawling the web is a complex task, and there are many ways to go about
it. We should be asking a few questions before going any further:
Is it a crawler for HTML pages only? Or should we fetch and store other
types of media, such as sound files, images, videos, etc.? This is
important because the answer can change the design. If we are writing a
general-purpose crawler to download different media types, we might
want to break down the parsing module into different sets of modules:
one for HTML, another for images, another for videos, where each
module extracts what is considered interesting for that media type.
Let’s assume for now that our crawler is going to deal with HTML only,
but it should be extensible and make it easy to add support for new
media types.
What protocols are we looking at? HTTP? What about FTP links? What
different protocols should our crawler handle? For the sake of the
exercise, we will assume HTTP. Again, it shouldn’t be hard to extend the
design to use FTP and other protocols later.
What is the expected number of pages we will crawl? How big will the
URL database become? Assuming we need to crawl one billion websites.
Since a website can contain many, many URLs, let’s assume an upper
bound of 15 billion different web pages that will be reached by our
crawler.
If we want to crawl 15 billion pages within four weeks, how many pages
do we need to fetch per second?
What about storage? Page sizes vary a lot, but as mentioned above since
we will be dealing with HTML text only, let’s assume an average page
size be 100KB. With each page if we are storing 500 bytes of metadata,
total storage we would need:
Assuming a 70% capacity model (we don’t want to go above 70% of the
total capacity of our storage system), total storage we will need:
The basic algorithm executed by any Web crawler is to take a list of seed
URLs as its input and repeatedly execute the following steps.
How to crawl?
There are two important characteristics of the Web that makes Web
crawling a very difficult task:
1. Large volume of Web pages: A large volume of web page implies that
web crawler can only download a fraction of the web pages at any time
and hence it is critical that web crawler should be intelligent enough to
prioritize download.
Let’s assume our crawler is running on one server, and all the crawling is
done by multiple working threads, where each working thread performs
all the steps needed to download and process a document in a loop.
The first step of this loop is to remove an absolute URL from the shared
URL frontier for downloading. An absolute URL begins with a scheme
(e.g., “HTTP”), which identifies the network protocol that should be used
to download it. We can implement these protocols in a modular way for
extensibility, so that later if our crawler needs to support more protocols,
it can be easily done. Based on the URL’s scheme, the worker calls the
appropriate protocol module to download the document. After
downloading, the document is placed into a Document Input Stream
(DIS). Putting documents into DIS will enable other modules to re-read
the document multiple times.
Once the document has been written to the DIS, the worker thread
invokes the dedupe test to determine whether this document (associated
with a different URL) has been seen before. If so, the document is not
processed any further, and the worker thread removes the next URL
from the frontier.
Furthermore, our HTML processing module will extract all links from
the page. Each link is converted into an absolute URL and tested against
a user-supplied URL filter to determine if it should be downloaded. If the
URL passes the filter, the worker performs the URL-seen test, which
checks if the URL has been seen before, namely, if it is in the URL
frontier or has already been downloaded. If the URL is new, it is added to
the frontier.
Let’s discuss these components one by one, and see how they can be
distributed onto multiple machines:
1. The URL frontier: The URL frontier is the data structure that contains
all the URLs that remain to be downloaded. We can crawl by performing
a breadth-first traversal of the Web, starting from the pages in the seed
set. Such traversals are easily implemented by using a FIFO queue.
Since we will be having a huge list of URLs to crawl, so we can distribute
or URL frontier into multiple servers. Let’s assume on each server we
have multiple worker threads performing the crawling tasks. Let’s
assume that our hash function maps each URL to a host which will be
responsible for crawling it.
How big our URL frontier would be? The size would be in the hundreds
of millions of URLs. Hence, we need to store our URLs on disk. We can
implement our queues in such a way that they have separate buffers for
enqueuing and dequeuing. Enqueue buffer, once filled will be dumped to
the disk, whereas dequeue buffer will keep a cache of URLs that need to
be visited, it can periodically read from disk to fill the buffer.
A DIS is an input stream that caches the entire contents of the document
read from the internet. It also provides methods to re-read the
document. The DIS can cache small documents (64 KB or less) entirely
in memory, while larger documents can be temporarily written to a
backing file.
How big would be the checksum store? If the whole purpose of our
checksum store is to do dedup, then we just need to keep a unique set
containing checksums of all previously processed document. Considering
15 billion distinct web pages, we would need:
7. URL dedup test: While extracting links, any Web crawler will
encounter multiple links to the same document. To avoid downloading
and processing a document multiple times, a URL dedup test must be
performed on each extracted link before adding it to the URL frontier.
To perform the URL dedup test, we can store all the URLs seen by our
crawler in canonical form in a database. To save space, we do not store
the textual representation of each URL in the URL set, but rather a fixed-
sized checksum.
To reduce the number of operations on the database store, we can keep
an in-memory cache of popular URLs on each host shared by all threads.
The reason to have this cache is that links to some URLs are quite
common, so caching the popular ones in memory will lead to a high in-
memory hit rate.
How much storage we would need for URL’s store? If the whole purpose
of our checksum is to do URL dedup, then we just need to keep a unique
set containing checksums of all previously seen URLs. Considering 15
billion distinct URLs and 2 bytes for checksum, we would need:
Can we use bloom filters for deduping? Bloom filters are a probabilistic
data structure for set membership testing that may yield false positives.
A large bit vector represents the set. An element is added to the set by
computing ‘n’ hash functions of the element and setting the
corresponding bits. An element is deemed to be in the set if the bits at all
‘n’ of the element’s hash locations are set. Hence, a document may
incorrectly be deemed to be in the set, but false negatives are not
possible.
The disadvantage to using a bloom filter for the URLseen test is that each
false positive will cause the URL not to be added to the frontier, and
therefore the document will never be downloaded. The chance of a false
positive can be reduced by making the bit vector larger.
8. Checkpointing: A crawl of the entire Web takes weeks to complete. To
guard against failures, our crawler can write regular snapshots of its state
to disk. An interrupted or aborted crawl can easily be restarted from the
latest checkpoint.
7. Fault tolerance
8. Data Partitioning
Our crawler will be dealing with three kinds of data: 1) URLs to visit 2)
URL checksums for dedup 3) Document checksums for dedup.
9. Crawler Traps
There are many crawler traps, spam sites, and cloaked content. A crawler
trap is a URL or set of URLs that cause a crawler to crawl indefinitely.
Some crawler traps are unintentional. For example, a symbolic link
within a file system can create a cycle. Other crawler traps are introduced
intentionally. For example, people have written traps that dynamically
generate an infinite Web of documents. The motivations behind such
traps vary. Anti-spam traps are designed to catch crawlers used by
spammers looking for email addresses, while other sites use traps to
catch search engine crawlers to boost their search ratings.
Since the Lambda page continuously collects the tax, eventually it will be
the page with the largest amount of credit, and we’ll have to “crawl” it. By
crawling the Lambda page, we just take its credits and distribute them
equally to all the pages in our database.
Since bot traps only give internal links credits and they rarely get credit
from the outside, they will continually leak credits (from taxation) to the
Lambda page. The Lambda page will distribute that credits out to all the
pages in the database evenly, and upon each cycle, the bot trap page will
lose more and more credits until it has so little credits that it almost
never gets crawled again. This will not happen with good pages because
they often get credits from backlinks found on other pages.
Let's design Facebook's Newsfeed, which would contain posts, photos, videos
and status updates from all the people and pages a user follows.
Any social media site you design - Twitter, Instagram or Facebook, you
will need some sort of newsfeed system to display updates from friends
and followers.
Functional requirements:
Non-functional requirements:
Let’s assume on average a user has 300 friends and follows 200 pages.
Traffic estimates: Let’s assume 300M daily active users, with each user
fetching their timeline an average of five times a day. This will result in
1.5B newsfeed requests per day or approximately 17,500 requests per
second.
4. System APIs
Parameters:
user_id (number): The ID of the user for whom the system will generate
the newsfeed.
max_id (number): Optional; returns results with an ID less than (that is,
older than) or equal to the specified ID.
5. Database Design
There are three basic objects: User, Entity (e.g., page, group, etc.) and
FeedItem (or Post). Here are some observations about the relationships
between these entities:
● A User can follow entities and can become friends with other users.
● Both users and entities can post FeedItems which can contain text,
images or videos.
● Each FeedItem will have a UserID which would point to the User
who created it. For simplicity, let’s assume that only users can
create feed items, although on Facebook, Pages can post feed item
too.
● Each FeedItem can optionally have an EntityID pointing to the
page or the group where that post was created.
Feed generation: Newsfeed is generated from the posts (or feed items) of
users and entities (pages and groups) that a user follows. So, whenever
our system receives a request to generate the feed for a user (say Jane),
we will perform following steps:
One thing to notice here is that we generated the feed once and stored it
in cache. What about new incoming posts from people that Jane follows?
If Jane is online, we should have a mechanism to rank and add those
new posts to her feed. We can periodically (say every five minutes)
perform the above steps to rank and add the newer posts to her feed.
Jane can then be notified that there are newer items in her feed that she
can fetch.
Feed publishing: Whenever Jane loads her newsfeed page, she has to
request and pull feed items from the server. When she reaches the end of
her current feed, she can pull more data from the server. For newer items
either the server can notify Jane and then she can pull, or the server can
push these new posts. We will discuss these options in detail later.
At a high level, we would need following components in our Newsfeed
service:
Let’s take the simple case of the newsfeed generation service fetching
most recent posts from all the users and entities that Jane follows; the
query would look like this:
LIMIT 100
Here are issues with this design for the feed generation service:
Whenever these servers need to generate the feed for a user, they would
first query to see what was the last time the feed was generated for that
user. Then, new feed data would be generated from that time onwards.
We can store this data in a hash table, where the “key” would be UserID
and “value” would be a STRUCT like this:
Struct {
LinkedHashMap<FeedItemID> feedItems;
DateTime lastGenerated;
Should we generate (and keep in memory) newsfeed for all users? There
will be a lot of users that don’t login frequently. Here are a few things we
can do to handle this. A simpler approach could be to use an LRU based
cache that can remove users from memory that haven’t accessed their
newsfeed for a long time. A smarter solution can figure out the login
pattern of users to pre-generate their newsfeed, e.g., At what time of the
day a user is active? Which days of the week a user accesses their
newsfeed? etc.
Let’s now discuss some solutions to our “live updates” problems in the
following section.
b. Feed publishing
How many feed items can we return to the client in each request? We
should have a maximum limit for the number of items a user can fetch in
one request (say 20). But we should let clients choose to specify how
many feed items they want with each request, as the user may like to
fetch a different number of posts depending on the device (mobile vs
desktop).
Should we always notify users if there are new posts available for their
newsfeed? It could be useful for users to get notified whenever new data
is available. However, on mobile devices, where data usage is relatively
expensive, it can consume unnecessary bandwidth. Hence, at least for
mobile devices, we can choose not to push data, instead let users “Pull to
Refresh” to get new posts.
8. Feed Ranking
Since we have a huge number of new posts every day and our read load is
extremely high too, we need to distribute our data onto multiple
machines such that we can read/write it efficiently. For sharding our
databases that are storing posts and their metadata, we can have a
similar design as discussed under Designing Twitter.
For feed data, which is being stored in memory, we can partition it based
on UserID. We can try storing all the data of a user on one server. when
storing, we can pass the UserID to our hash function that will map the
user to a cache server where we will store the user’s feed objects. Also, for
any given user, since we don’t expect to store more than 500
FeedItmeIDs, we wouldn’t run into a scenario where feed data for a user
doesn’t fit on a single server. To get the feed of a user, we would always
have to query only one server. For future growth and replication, we
must use Consistent Hashing.
Designing Yelp
Let's design a Yelp like service, where users can search for nearby places like
restaurants, theaters or shopping malls, etc., and can also add/view reviews of
places.
Similar Services: Proximity server.
Difficulty Level: Hard
What do we wish to achieve from a Yelp like service? Our service will be
storing information about different places so that users can perform a
search on them. Upon querying, our service will return a list of places
around the user.
Functional Requirements:
3. Scale Estimation
Let’s build our system assuming that we have 500M places and 100K
queries per second (QPS). Let’s also assume a 20% growth in the number
of places and QPS each year.
4. Database Schema
1. LocationID (8 bytes)
2. ReviewID (4 bytes): Uniquely identifies a review, assuming any
location will not have more than 2^32 reviews.
3. ReviewText (512 bytes)
4. Rating (1 byte): how many stars a place gets out of ten.
Similarly, we can have a separate table to store photos for Places and
Reviews.
5. System APIs
Parameters:
page_token (string): This token will specify a page in the result set that
should be returned.
Returns: (JSON)
At a high level, we need to store and index each dataset described above
(places, reviews, etc.). For users to query this massive database, the
indexing should be read efficient, since while searching for nearby places
users expect to see the results in real-time.
Given that the location of a place doesn’t change that often, we don’t
need to worry about frequent updates of the data. As a contrast, if we
intend to build a service where objects do change their location
frequently, e.g., people or taxis, then we might come up with a very
different design.
Let’s see what are different ways to store this data and find out which
method will suit best for our use cases:
a. SQL solution
One simple solution could be to store all the data in a database like
MySQL. Each place will be stored in a separate row, uniquely identified
by LocationID. Each place will have its longitude and latitude stored
separately in two different columns, and to perform a fast search; we
should have indexes on both these fields.
To find all the nearby places of a given location (X, Y) within a radius ‘D’,
we can query like this:
Select * from Places where Latitude between X-D and X+D and
Longitude between Y-D and Y+D
How efficient this query would be? We have estimated 500M places to be
stored in our service. Since we have two separate indexes, each index can
return a huge list of places, and performing an intersection on those two
lists won’t be efficient. Another way to look at this problem is that there
could be too many locations between ‘X-D’ and ‘X+D’, and similarly
between ‘Y-D’ and ‘Y+D’. If we can somehow shorten these lists, it can
improve the performance of our query.
b. Grids
We can divide the whole map into smaller grids to group locations into
smaller sets. Each grid will store all the Places residing within a certain
range of longitude and latitude. This scheme would enable us to query
only a few grids to find nearby places. Based on given location and
radius, we can find all the nearby grids and then only query these grids to
find nearby places.
Let’s assume that GridID (a four bytes number) would uniquely identify
grids in our system.
What could be a reasonable grid size? Grid size could be equal to the
distance we would like to query since we also want to reduce the number
of grids. If the grid size is equal to the distance we want to query, then we
only need to search within the grid which contains the given location and
neighboring eight grids. Since our grids would be statically defined (from
the fixed grid size), we can easily find the grid number of any location
(lat, long) and its neighboring grids.
In the database, we can store the GridID with each location and have an
index on it too for faster searching. Now, our query will look like:
Select * from Places where Latitude between X-D and X+D and
Longitude between Y-D and Y+D and GridID in (GridID, GridID1,
GridID2, ..., GridID8)
How much memory will we need to store the index? Let’s assume our
search radius is 10 miles, given that total area of the earth is around 200
million square miles; we will have 20 million grids. We would need a
four bytes number to uniquely identify each grid, and since LocationID is
8 bytes, therefore we would need 4GB of memory (ignoring hash table
overhead) to store the index.
(4 * 20M) + (8 * 500M) ~= 4 GB
This solution can still run slow for those grids that have a lot of places
since our places are not uniformly distributed among grids. We can have
a thickly dense area with a lot of places, and on the other hand, we can
have areas which are sparsely populated.
This problem can be solved if we can dynamically adjust our grid size,
such that whenever we have a grid with a lot of places we break it down
to create smaller grids. One challenge with this approach could be, how
would we map these grids to locations? Also, how can we find all the
neighboring grids of a grid?
Let’s assume we don’t want to have more than 500 places in a grid so
that we can have a faster searching. So, whenever a grid reaches this
limit, we break it down into four grids of equal size and distribute places
among them. This means thickly populated areas like downtown San
Francisco will have a lot of grids, and sparsely populated area like the
Pacific Ocean will have large grids with places only around the coastal
lines.
What data-structure can hold this information? A tree in which each
node has four children can serve our purpose. Each node will represent a
grid and will contain information about all the places in that grid. If a
node reaches our limit of 500 places, we will break it down to create four
child nodes under it and distribute places among them. In this way, all
the leaf nodes will represent the grids that cannot be further broken
down. So leaf nodes will keep a list of places with them. This tree
structure in which each node can have four children is called a QuadTree
How will we build QuadTree? We will start with one node that would
represent the whole world in one grid. Since it will have more than 500
locations, we will break it down into four nodes and distribute locations
among them. We will keep repeating this process with each child node
until there are no nodes left with more than 500 locations.
How will we find the grid for a given location? We will start with the root
node and search downward to find our required node/grid. At each step,
we will see if the current node we are visiting has children if it has we will
move to the child node that contains our desired location and repeat this
process. If the node does not have any children, then that is our desired
node.
How will we find neighboring grids of a given grid? Since only leaf nodes
contain a list of locations, we can connect all leaf nodes with a doubly
linked list. This way we can iterate forward or backward among the
neighboring leaf nodes to find out our desired locations. Another
approach for finding adjacent grids would be through parent nodes. We
can keep a pointer in each node to access its parent, and since each
parent node has pointers to all of its children, we can easily find siblings
of a node. We can keep expanding our search for neighboring grids by
going up through the parent pointers.
What will be the search workflow? We will first find the node that
contains the user’s location. If that node has enough desired places, we
can return them to the user. If not, we will keep expanding to the
neighboring nodes (either through the parent pointers or doubly linked
list), until either we find the required number of places or exhaust our
search based on the maximum radius.
How much memory will be needed to store the QuadTree? For each
Place, if we cache only LocationID and Lat/Long, we would need 12GB to
store all places.
24 * 500M => 12 GB
Since each grid can have maximum 500 places and we have 500M
locations, how many total grids we will have?
Which means we will have 1M leaf nodes and they will be holding 12GB
of location data. A QuadTree with 1M leaf nodes will have approximately
1/3rd internal nodes, and each internal node will have 4 pointers (for its
children). If each pointer is 8 bytes, then the memory we need to store all
internal nodes would be:
1M * 1/3 * 4 * 8 = 10 MB
How would we insert a new Place into our system? Whenever a new
Place is added by a user, we need to insert it into the databases, as well
as, in the QuadTree. If our tree resides on one server, it is easy to add a
new Place, but if the QuadTree is distributed among different servers,
first we need to find the grid/server of the new Place and then add it
there (discussed in the next section).
7. Data Partitioning
What if we have a huge number of places such that, our index does not fit
into a single machine’s memory? With 20% growth, each year, we will
reach the memory limit of the server in the future. Also, what if one
server cannot serve the desired read traffic? To resolve these issues, we
must partition our QuadTree!
a. Sharding based on regions: We can divide our places into regions (like
zip codes), such that all places belonging to a region will be stored on a
fixed node. While storing, we will find the region of each place to find the
server and store the place there. Similarly, while querying for nearby
places, we can ask the region server that contains user’s location. This
approach has a couple of issues:
What if both primary and secondary servers die at the same time? We
have to allocate a new server and rebuild the same QuadTree on it. How
can we do that, since we don’t know what places were kept on this
server? The brute-force solution would be to iterate through the whole
database and filter LocationIDs using our hash function to figure out all
the required places that will be stored on this server. This would be
inefficient and slow, also during the time when the server is being
rebuilt; we will not be able to serve any query from it, thus missing some
places that should have been seen by users.
9. Cache
A problem with Round Robin LB is, it won’t take server load into
consideration. If a server is overloaded or slow, the load balancer will not
stop sending new requests to that server. To handle this, a more
intelligent LB solution would be needed that periodically queries
backend server about their load and adjusts traffic based on that.
11. Ranking
How about if we want to rank the search results not just by proximity but
also by popularity or relevance?
How can we return most popular places within a given radius? Let’s
assume we keep track of the overall popularity of each place. An
aggregated number can represent this popularity in our system, e.g., how
many stars a place gets out of ten (this would be an average of different
rankings given by users)? We will store this number in the database, as
well as, in the QuadTree. While searching for top 100 places within a
given radius, we can ask each partition of the QuadTree to return top 100
places having maximum popularity. Then the aggregator server can
determine top 100 places among all the places returned by different
partitions.
Let's design a ride-sharing service like Uber, which connects passengers who
need a ride with drivers who have a car.
1. What is Uber?
Uber enables its customers to book drivers for taxi rides. Uber drivers
use their personal cars to drive customers around. Both customers and
drivers communicate with each other through their smartphones using
Uber app.
● Since all active drivers are reporting their locations every three
seconds, we need to update our data structures to reflect that. If we
have to update the QuadTree for every change in the driver’s
position, it will take a lot of time and resources. To update a driver
to its new location, we must find the right grid based on the driver’s
previous location. If the new position does not belong to the
current grid, we have to remove the driver from the current grid
and move/reinsert the user to the correct grid. After this move, if
the new grid reaches the maximum limit of drivers, we have to
repartition it.
● We need to have a quick mechanism to propagate current location
of all the nearby drivers to any active customer in that area. Also,
when a ride is in progress, our system needs to notify both the
driver and passenger about the current location of the car.
(500K * 3) + (500K * 5 * 8 ) ~= 21 MB
How the new publishers/drivers will get added for a current customer?
As we have proposed above that customers will be subscribed to nearby
drivers when they open the Uber app for the first time, what will happen
when a new driver enters the area the customer is looking at? To add a
new customer/driver subscription dynamically, we need to keep track of
the area the customer is watching. This will make our solution
complicated, what if instead of pushing this information, clients pull it
from the server?
How about if clients pull information about nearby drivers from the
server? Clients can send their current location, and the server will find all
the nearby drivers from the QuadTree to return them to the client. Upon
receiving this information, the client can update their screen to reflect
current positions of the drivers. Clients can query every five seconds to
limit the number of round trips to the server. This solution looks quite
simpler compared to the push model described above.
6. Ranking
How about if we want to rank the search results not just by proximity but
also by popularity or relevance?
How can we return top rated drivers within a given radius? Let’s assume
we keep track of the overall ratings of each driver in our database and
QuadTree. An aggregated number can represent this popularity in our
system, e.g., how many stars a driver gets out of ten? While searching for
top 10 drivers within a given radius, we can ask each partition of the
QuadTree to return top 10 drivers with maximum rating. The aggregator
server can then determine top 10 drivers among all the drivers returned
by different partitions.
7. Advanced Issues
1. How to handle clients on slow and disconnecting networks?
2. What if a client gets disconnected when it was a part of a ride? How
will we handle billing in such a scenario?
3. How about if clients pull all the information as compared to servers
always pushing it?