0% found this document useful (0 votes)
17 views

System Design

Uploaded by

debjitroy1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

System Design

Uploaded by

debjitroy1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

System design

Contents
• Rate limiter • Proximity service
• Consistent hashing • Nearby friends
• Key-value store
• Job scheduler
• Unique ID generator
• URL Shortener • maps
• Web Crawler
• Notification System
• News feed
• Chat system
• Search autocomplete
• Youtube
• Google Drive
General questions
• How is cache coherency maintained
• SQL vs NoSQL
• Sharding and replication of databases
• Bloom filter
• DB normalization denormalization
• zookeeper
Rate limiter
• Design goals:
• API ratelimiter for HTTP requests to servers
• Should be global as well as granular (based on IP address/ user ID)
• Ratelimit rules should be configurable
• Algorithm should be configurable
• Low latency, low memory
• Distributed
• Users should be able to see exceptions
• Fault tolerance
Q for Ratelimited requests
API
RESPONSE Servers
Rate Limiter GW
Clients YES

Ratelimit exceeded

Rules
cache Rules DB
Rate Limiter GW Redis for should be
ratelimit persistent
state

workers
Rate Limiter
Rate Limiter GW workers
Rate limiter algorithms
• ToKen bucket/leaking bucket – bucket for each actor, counter
incremented periodically, decremented when requests come in
• Sliding window- count number of requests in the last window based
on rule, increment on request,
• Whichever algo we use, need to store counters in DB/redis, redis
better due to fast access and INCR DECR EXPIRE operations
Consistent Hashing
• For horizontal scaling, need to distribute requests/data across
multiple servers
• Normal hashing works but when num of servers changes,
redistribution is required
• Only k/n keys need to be redistributed with consistent hashing
Basic approach k0
s0
k1

s4
• Create hash space using sha algorithm
• Map servers to hash space k2
• Hash keys onto the ring
s3
• Move clockwise on hash ring from k to s s1
to decide which server key is assigned to,
k0 – s0 k1 k2 – s1 …
• When new server is added, only few keys
have to be reassigned
• Partition size not same k3

• Non-uniform key distribution s2


Virtual nodes s0
S0_0

s1_
• For balancing data distribution , create 0

fake nodes on the ring, multiple node s0


belonging to each server
• Hotspot problem resolved
• With enough nodes, partition size also S1_
1
will be similar S0_
1

S1_
2 S2_
1
Key value store
• Small size kv pair
• Big data should be storable
• HA : should respond quickly during failures
• High scalability: scaled to support large dataset
• Auto scaling
• Tunable consistency
• Low latency
n1
CAP theorem
• To store big data, it is necessary to have distributed storage n2 n3

• CAP theorem, tradeoff between consistency and availability as partition tolerance will
never be guaranteed
• Ideally no network failure then CA can be guaranteed as data written to one node can be
replicated to other nodes
• If there is partition in network, say n3 is down and data is written to n3 but not
propagated to n1 and n2 or it is written to n1 or n2 but not propagated to n3 then there
will be stale data
• Consistency can be guaranteed by blocking write operations when nodes are down
• Availability will ensure reads and writes always go through with the risk of stale data
Components
• Replication: tunable replication factor, in hash ring add key to next R
virtual nodes, choose only unique servers. Even better if replicated on
different datacenters
• Partition: Consistent hashing, supports auto scaling and heterogeneity
(servers with higher capacity have more virtual nodes)
• Consistency: Quorum consensus, N replicas, W write quorum, R read
quorum, for write to be considered successful, W replicas must ack,
W+R>N means strong consistency
• Inconsistency resolution, for high availability we adopt eventual
consistency and inconsistency resolution, use vector clocks
• Failure handling: Detection using gossip protocol,
Failure handling
• Detection using gossip protocol, heartbeat messages between
random nodes, if heartbeat does not increment for certain time, ask
random nodes about situation once everyone confirms then we dead
• Temp failures handling:
• sloppy quorum for HA, instead of enforcing quorum consensus, first W & R
live servers for writes read considered
• Hinted handoff: when A is down, someone else will handle requests and after
A is up will push back changes for consistency
• Permanent failure handling: Merkle tree to optimize, bucketize keys
and create tree where parent is hash of children, if root is diff then dfs
over tree to find which bucket is out of sync, then sync them
Write path Read path
2 Memory
client client Mem cache
cache

1 3 flush

Result SSTable Bloom filter


Commit log SSTable
Unique ID generator in distributed system
• ID must be unique
• 64 bit integer
• Ordered by date
• Scalable, 10000 IDs per second
Candidate approaches
• Multi master replication: N database servers, each auto_increment by
N so each once will give different IDs,
• hard to scale with multiple datacenters,
• IDs do not go up with time across multiple servers,
• when server is added/removed, does not scale well as formula needs to
change everywhere
• UUID: 128 bit, not numeric, no time monotonicity
• Ticket server: Separate central DB for auto-increments,
• Snowflake: divide 64 bits into sections:
Sequence number 12
41 bits timestamp Datacenter ID 5bits Machine ID 5bits
bits
URL Shortener
• Shorter URL that redirects to actual URL
• 100 million URL per day
• As short as possible URL length
• Deletion/updation required
• Given long URL return short URL
• When short URL is called, redirect to long URL
• HA, scalable and fault tolerant
calculations
• Write: 100 million URL per day, 1160 URL per second
• Read: more read than write, assume 11600 URL per second
• Longevity: say 10 years, then we need to support 365billion records
• Storage 365TB
API endpoints api/v1/shorten
client server
shorturl
• Shorten URL:
• api/v1/shorten
• Param longURL string
api/v1/shorturl
• Return shortURL client server
301 redirect to
• Get Long URL: longURL
• Api/v1/shortURL
• Redirect to long URL
using http 301 302
longURL
Deep Dive
• Store in relational DB with primary key, shortURL and longURL fields
• Relational DB better as read heavy and we don’t need joins so sharding
should be fine, need to fetch based on longURL and shortURL
• Conversion logic:
• we need to store 365billion URLs, so we need (26+26+10)^7 so 7 digit hash
• Hash+collision resolution: try hash, if exists then append some known buffer to the
longURL and calc again. Use bloom filters to make queries faster.
• UID not required,
• collision paths will be slow,
• impossible to guess URLs so more secure
• Base62 encoding, use primary key (generated from UID) as decimal input and
convert it to base62.
• UID required
• Faster
• Next URL can be guessed if UID increments by 1 everytime
Flow and architecture

Is longURL in Generate
longURL
DB ID
api/v1/shorturl
client server
301 redirect to
longURL
Generate
shortURL

DB

cache
DB
Web Crawler
• Algorithm: Given set of URLs,
• download all web pages addressed by that URL
• Parse the web pages and Extract links
• Add new links to Q and continue
• Functional Reqs
• Purpose? Search engine indexing, data mining, web monitoring
• Web pages per month: 1bill
• What content types? Html for now
• Storage requirements? N years
• Duplicate content? Ignore?
• NFR
• Scalability: require parallelization for scalability
• Robustness: crashes, malicious links etc
• Politeness: too many requests to same site should be limited
Estimations Seed URLs

• 1B pages per month


URL Q
• QPS = 1bil / 30day / 24 /
3600 = 400 pages per
second HTML
Downloader
• Average web page size:
1MB
URL
HTML Parser URL matcher
• 1000TB storage per storage
month = 60PB storage
Content Link
matcher extractor
Content
storage
Prioritizer takes URLs as input
URL Q Prioritizer and compute priority
• Politeness: avoid hitting Each Q has elements with same
sites too, simple to do with Q Q Q priority, randomly picked with
weight by selector
delay
• Priority: prioritize important Q selector
Q router ensures each Q has URL
webpages from same website only
Mapping table maps host to Q
• Freshness: recrawl based on Mapping Q router
table
priority to have updates Each Q has URLs to same host

• Storage on disk as Q would Q Q Q


be huge, write heavy so
NoSQL better, buffers for Each worker is for a specific host,
Q selector
read and write add delays between them for
politeness

W W W
HTML Downloader
• Robots.txt should be parsed and followed, cache it to avoid multiple downloads
• Distributed crawl: URL Q is polled by different servers which are multithreaded
• Cache DNS resolver: cache IP address for websites to avoid too many DNS
lookups
• Distribute servers geographically to decrease latency
• Configure timeouts as some pages are extremely slow
NFRs
• Robustness:
• HA using consistent hashing to distribute load among downloaders
• Save crawl states and data: can guard against traps and failures, resume failed crawl
• Exception handling: gracefully handle without crashing the system
• Data validation: can avoid traps and prevent errors
• Extensibility: extension modules in parallel to link extractor can be added
• Redundant content: Duplicate detection using hashes and other algorithms
• Spider trap detection and avoidance
• Data noise avoidance such as ads snippets, spam URLs
• Spam detection
• Stateless servers, database replication and sharding
Notification System
• What kind of notifications ? Push sms email
• Realtime? Is slight delay fine
• Supported devices? Android ios desktop
• Notif triggers? Client side or server side
• Client side push notif framework? 3rd party
• SMS and email gateway?
• Any other requirements? User registration and deregistration
• Scale?
service 1
Initial design Notification
servers
Messa
ge Q wr
ok
- Provide devi
Android er
API to ce
• Problems: service 2 services s
- Validation 3rd party
• Single point of failure ios
s services
• Not scalable as NS will be - Build and
notif gateway devi
bottleneck and cant store payload
SMS
s
service 3 ce
much data - Auth, RL
• Should have microservices email
for different notif types to Cache
User info,
decrease complexity service n device info, devi
ce
• Retry mechanism, put notif
back message in Q if send templates
failed, store it in notif log
Notif devi
DB log ce
News feed
• Requirements:
• User can publish post and see posts from friends
• Sorted in reverse chronological order, close friends have higher score
• How many friends
• How many DAU
• Type of data in feed can be images text etc
• APIS:
• Publish post: POST /v1/userid/feed
• Content of post
• Auth_token for auth
• Get posts: GET /v1/userid/feed
• Auth_token
client

Publish post LB

• Graph db avoids joins Post service


Web servers
and optimizes fetching
relations between Friend id
objects neo4j Post
Publish
service
mapping
cache DB
• Newsfeed cache graph
contains <post_id,
MQ
user_id> POST
DB
User
Data
• Publish can be push or Publish SQL
pull, for general users workers
push is fine, for
hotspots, pull is better newsfeed
cache
client CDn

Get posts LB

• Auth and RL in Post service


Web servers
servers
• Get post_id from newsfeed
Friend id
Post mapping
newsfeed cache cache
service DB
graph
• Get rich info from
user data, post db POST
and cdn DB
User
Data
SQL

newsfeed
cache
Chat system
• 1v1 or group based
• Scale?
• Group size
• 1v1 chat, group chat online indicator
• Message size limit
• Storage of chat history needed?
• Login and use from multiple devices
• Push notifs
Basic design
• Sender connects to chat svc using
sender
protocol:
• HTTP sender side is fine
• Receiver must maintain longer
Chat service:
connection: 1. Store message
• Long polling: client keeps long connection 2. Relay message
with server, when message is received,
server sends it and closes connection
• Websocket connection best option,
bidirectional between server and client,
both send and receive can be done using receiver
this
API servers and storage
• API servers for user auth and service sender receiver
discovery
• Service discovery can be done by
zookeeper API servers
Notification 1. Service finder Chat service:
• Storage to store use info, group service 2. Authentication 1. Store message
membership, chat history 3. User data 2. Relay message
• SQL for user info, contacts etc static data 4. Group data

• Can be scaled using sharding and


replication
• Data is enormous for chat messages, do
not require relational patterns,
sql nosql nosql
• Random access and low latency required
• Use noSQl like Cassandra
Data model Message
Message_id
Message_from
Message_to
Content
• Message_id used for ordering, needs to be timestamp
unique among a group, can be generated
locally for group0
• Message_id can als be used to sync
between multiple devices of same user
Group_message
group_id
Message_id
Message_to
Content
timestamp
1v1 message flow group message flow
sender sender
sender receiver

Chat service: Chat service:


Chat service:
1. Store message 1. Store message
1. Store message
2. Relay message 2. Relay message
2. Relay message
rx

Message queue
Message queue

Message queue
nosql nosql

rx
Search autocomplete system
• Matching to be done at beginning or throughout different words
• How many suggestions
• Which 5 suggestions
• Input string format and validation
• Scalable and HA
• Within 100ms result should return
• Relevant
• Sorted by popularity
Components Analytics
logs
• Data gathering service: aggregators Agg DB
workers
• Frequency of searches updated in storage can be
done weekly
• Workers build trie and store in DB
• Query service:
• Given search query, return 5 most frequently used Trie DB
terms Trie Cache
• Needs to be fast
• Trie data structure:
• Tree data structure with root as “” and 26 children
each with each char added
• Frequency info also part of nodes
• To optimize we can add top k most frequent
children in the node
• Trie DB:
• Can be document store like mongoDB so that
serialization and snapshot is easy
• Can be noSQL as tree can be stored as hash table
Optimizations and scalability
• Browser caching
• AJAX request so that full page need not be refreshed
• Scaling storage:
• sharding based on first alphabet
• Combine less frequent ones into 1 shard
Youtube
• Upload and watch videos
• International users
• Different video resolutions
• Fast uploads
• Smooth streaming
• Change video quality
• Low cost
• HA, scalability, reliability
Blob user
Upload flow storage

• For video streaming, read API server


from metadata storage Download
module
and stream from
Metadata
corresponding CDN transcoding cache metadata
encoding DB
module servers

upload Completion
module Completion Q handlers

transcoded CDN
storage
System optimizations
• Video upload from client can be made parallel by diving video into
chunks and uploading
• Transcoding can use fan-out fan-in for multiple tasks with message
queues and distributed servers
• Use CDN as upload center, geographically distributed upload servers
• Message queues between servers to have parallelism everywhere
• Pre signed URL from API servers for auth and security
Error handling
• Recoverable errors can be worked around by retry
• Non recoverable errors: stop all work related to this request and
return error
Google Drive
• Upload and download files.
• Sync files across devices
• See file revisions
• Share files
• Notification send when file is edited deleted or shared
• Reliability
• Sync speed
• BW usage
• Scalability
• HA
APIs
• Upload
• /files/upload?uploadType=resumable POST
• Download
• /files/download POST params are file path
• Revisions
• /files/revisions POST path, numrevisions
Design
• Keep file storage in S3 and metadata in
DB
• File is distributed to blocks, on
changing file only corresponding block
block servers
is changed on storage, delta sync
• Blocks can be compressed
API servers
• Consistency:
• Cache to have strong consistency using
consensus and invalidation on DB write
• DB to be SQL to get ACID

Metadata
cache
Proximity service user

• Sharding is fine for Business


Location Business
table, SQL table is fine, quick service service
reads are required
• For geohash table, we may not
need sharding as the dataset Business Geohash
M
will not be that big, better to info table Rep R

have read replicas to make Redis cluster


DB cluster
reads fast
Nearby friends User
• Pub/sub to send latest location
to self channel which friends
will subscribe to LB

• Location cache as only recent


data required, will be queried websocket API
servers server
on startup and periodically

Pub/
sub
Location
Location User DB
cache
history
DB
Job scheduler
Maps User

geocoding Navigation
service service
User

Route planner
analytics service
Location service

Shortest path
ranker ETA svc
service
kafka Routing tile
User
location
DB

personalization
Traffic update

You might also like