System Design
System Design
Contents
• Rate limiter • Proximity service
• Consistent hashing • Nearby friends
• Key-value store
• Job scheduler
• Unique ID generator
• URL Shortener • maps
• Web Crawler
• Notification System
• News feed
• Chat system
• Search autocomplete
• Youtube
• Google Drive
General questions
• How is cache coherency maintained
• SQL vs NoSQL
• Sharding and replication of databases
• Bloom filter
• DB normalization denormalization
• zookeeper
Rate limiter
• Design goals:
• API ratelimiter for HTTP requests to servers
• Should be global as well as granular (based on IP address/ user ID)
• Ratelimit rules should be configurable
• Algorithm should be configurable
• Low latency, low memory
• Distributed
• Users should be able to see exceptions
• Fault tolerance
Q for Ratelimited requests
API
RESPONSE Servers
Rate Limiter GW
Clients YES
Ratelimit exceeded
Rules
cache Rules DB
Rate Limiter GW Redis for should be
ratelimit persistent
state
workers
Rate Limiter
Rate Limiter GW workers
Rate limiter algorithms
• ToKen bucket/leaking bucket – bucket for each actor, counter
incremented periodically, decremented when requests come in
• Sliding window- count number of requests in the last window based
on rule, increment on request,
• Whichever algo we use, need to store counters in DB/redis, redis
better due to fast access and INCR DECR EXPIRE operations
Consistent Hashing
• For horizontal scaling, need to distribute requests/data across
multiple servers
• Normal hashing works but when num of servers changes,
redistribution is required
• Only k/n keys need to be redistributed with consistent hashing
Basic approach k0
s0
k1
s4
• Create hash space using sha algorithm
• Map servers to hash space k2
• Hash keys onto the ring
s3
• Move clockwise on hash ring from k to s s1
to decide which server key is assigned to,
k0 – s0 k1 k2 – s1 …
• When new server is added, only few keys
have to be reassigned
• Partition size not same k3
s1_
• For balancing data distribution , create 0
S1_
2 S2_
1
Key value store
• Small size kv pair
• Big data should be storable
• HA : should respond quickly during failures
• High scalability: scaled to support large dataset
• Auto scaling
• Tunable consistency
• Low latency
n1
CAP theorem
• To store big data, it is necessary to have distributed storage n2 n3
• CAP theorem, tradeoff between consistency and availability as partition tolerance will
never be guaranteed
• Ideally no network failure then CA can be guaranteed as data written to one node can be
replicated to other nodes
• If there is partition in network, say n3 is down and data is written to n3 but not
propagated to n1 and n2 or it is written to n1 or n2 but not propagated to n3 then there
will be stale data
• Consistency can be guaranteed by blocking write operations when nodes are down
• Availability will ensure reads and writes always go through with the risk of stale data
Components
• Replication: tunable replication factor, in hash ring add key to next R
virtual nodes, choose only unique servers. Even better if replicated on
different datacenters
• Partition: Consistent hashing, supports auto scaling and heterogeneity
(servers with higher capacity have more virtual nodes)
• Consistency: Quorum consensus, N replicas, W write quorum, R read
quorum, for write to be considered successful, W replicas must ack,
W+R>N means strong consistency
• Inconsistency resolution, for high availability we adopt eventual
consistency and inconsistency resolution, use vector clocks
• Failure handling: Detection using gossip protocol,
Failure handling
• Detection using gossip protocol, heartbeat messages between
random nodes, if heartbeat does not increment for certain time, ask
random nodes about situation once everyone confirms then we dead
• Temp failures handling:
• sloppy quorum for HA, instead of enforcing quorum consensus, first W & R
live servers for writes read considered
• Hinted handoff: when A is down, someone else will handle requests and after
A is up will push back changes for consistency
• Permanent failure handling: Merkle tree to optimize, bucketize keys
and create tree where parent is hash of children, if root is diff then dfs
over tree to find which bucket is out of sync, then sync them
Write path Read path
2 Memory
client client Mem cache
cache
1 3 flush
Is longURL in Generate
longURL
DB ID
api/v1/shorturl
client server
301 redirect to
longURL
Generate
shortURL
DB
cache
DB
Web Crawler
• Algorithm: Given set of URLs,
• download all web pages addressed by that URL
• Parse the web pages and Extract links
• Add new links to Q and continue
• Functional Reqs
• Purpose? Search engine indexing, data mining, web monitoring
• Web pages per month: 1bill
• What content types? Html for now
• Storage requirements? N years
• Duplicate content? Ignore?
• NFR
• Scalability: require parallelization for scalability
• Robustness: crashes, malicious links etc
• Politeness: too many requests to same site should be limited
Estimations Seed URLs
W W W
HTML Downloader
• Robots.txt should be parsed and followed, cache it to avoid multiple downloads
• Distributed crawl: URL Q is polled by different servers which are multithreaded
• Cache DNS resolver: cache IP address for websites to avoid too many DNS
lookups
• Distribute servers geographically to decrease latency
• Configure timeouts as some pages are extremely slow
NFRs
• Robustness:
• HA using consistent hashing to distribute load among downloaders
• Save crawl states and data: can guard against traps and failures, resume failed crawl
• Exception handling: gracefully handle without crashing the system
• Data validation: can avoid traps and prevent errors
• Extensibility: extension modules in parallel to link extractor can be added
• Redundant content: Duplicate detection using hashes and other algorithms
• Spider trap detection and avoidance
• Data noise avoidance such as ads snippets, spam URLs
• Spam detection
• Stateless servers, database replication and sharding
Notification System
• What kind of notifications ? Push sms email
• Realtime? Is slight delay fine
• Supported devices? Android ios desktop
• Notif triggers? Client side or server side
• Client side push notif framework? 3rd party
• SMS and email gateway?
• Any other requirements? User registration and deregistration
• Scale?
service 1
Initial design Notification
servers
Messa
ge Q wr
ok
- Provide devi
Android er
API to ce
• Problems: service 2 services s
- Validation 3rd party
• Single point of failure ios
s services
• Not scalable as NS will be - Build and
notif gateway devi
bottleneck and cant store payload
SMS
s
service 3 ce
much data - Auth, RL
• Should have microservices email
for different notif types to Cache
User info,
decrease complexity service n device info, devi
ce
• Retry mechanism, put notif
back message in Q if send templates
failed, store it in notif log
Notif devi
DB log ce
News feed
• Requirements:
• User can publish post and see posts from friends
• Sorted in reverse chronological order, close friends have higher score
• How many friends
• How many DAU
• Type of data in feed can be images text etc
• APIS:
• Publish post: POST /v1/userid/feed
• Content of post
• Auth_token for auth
• Get posts: GET /v1/userid/feed
• Auth_token
client
Publish post LB
Get posts LB
newsfeed
cache
Chat system
• 1v1 or group based
• Scale?
• Group size
• 1v1 chat, group chat online indicator
• Message size limit
• Storage of chat history needed?
• Login and use from multiple devices
• Push notifs
Basic design
• Sender connects to chat svc using
sender
protocol:
• HTTP sender side is fine
• Receiver must maintain longer
Chat service:
connection: 1. Store message
• Long polling: client keeps long connection 2. Relay message
with server, when message is received,
server sends it and closes connection
• Websocket connection best option,
bidirectional between server and client,
both send and receive can be done using receiver
this
API servers and storage
• API servers for user auth and service sender receiver
discovery
• Service discovery can be done by
zookeeper API servers
Notification 1. Service finder Chat service:
• Storage to store use info, group service 2. Authentication 1. Store message
membership, chat history 3. User data 2. Relay message
• SQL for user info, contacts etc static data 4. Group data
Message queue
Message queue
Message queue
nosql nosql
rx
Search autocomplete system
• Matching to be done at beginning or throughout different words
• How many suggestions
• Which 5 suggestions
• Input string format and validation
• Scalable and HA
• Within 100ms result should return
• Relevant
• Sorted by popularity
Components Analytics
logs
• Data gathering service: aggregators Agg DB
workers
• Frequency of searches updated in storage can be
done weekly
• Workers build trie and store in DB
• Query service:
• Given search query, return 5 most frequently used Trie DB
terms Trie Cache
• Needs to be fast
• Trie data structure:
• Tree data structure with root as “” and 26 children
each with each char added
• Frequency info also part of nodes
• To optimize we can add top k most frequent
children in the node
• Trie DB:
• Can be document store like mongoDB so that
serialization and snapshot is easy
• Can be noSQL as tree can be stored as hash table
Optimizations and scalability
• Browser caching
• AJAX request so that full page need not be refreshed
• Scaling storage:
• sharding based on first alphabet
• Combine less frequent ones into 1 shard
Youtube
• Upload and watch videos
• International users
• Different video resolutions
• Fast uploads
• Smooth streaming
• Change video quality
• Low cost
• HA, scalability, reliability
Blob user
Upload flow storage
upload Completion
module Completion Q handlers
transcoded CDN
storage
System optimizations
• Video upload from client can be made parallel by diving video into
chunks and uploading
• Transcoding can use fan-out fan-in for multiple tasks with message
queues and distributed servers
• Use CDN as upload center, geographically distributed upload servers
• Message queues between servers to have parallelism everywhere
• Pre signed URL from API servers for auth and security
Error handling
• Recoverable errors can be worked around by retry
• Non recoverable errors: stop all work related to this request and
return error
Google Drive
• Upload and download files.
• Sync files across devices
• See file revisions
• Share files
• Notification send when file is edited deleted or shared
• Reliability
• Sync speed
• BW usage
• Scalability
• HA
APIs
• Upload
• /files/upload?uploadType=resumable POST
• Download
• /files/download POST params are file path
• Revisions
• /files/revisions POST path, numrevisions
Design
• Keep file storage in S3 and metadata in
DB
• File is distributed to blocks, on
changing file only corresponding block
block servers
is changed on storage, delta sync
• Blocks can be compressed
API servers
• Consistency:
• Cache to have strong consistency using
consensus and invalidation on DB write
• DB to be SQL to get ACID
Metadata
cache
Proximity service user
Pub/
sub
Location
Location User DB
cache
history
DB
Job scheduler
Maps User
geocoding Navigation
service service
User
Route planner
analytics service
Location service
Shortest path
ranker ETA svc
service
kafka Routing tile
User
location
DB
personalization
Traffic update