Cassandra PPT Final
Cassandra PPT Final
PRUTHA DESHPANDE-42
VAISHNAVI DESHMUKH-41
AMEYA DATE-33
Introduction
• Apache Cassandra is an open source distributed database management
system designed to handle large amounts of data across many commodity
servers, providing high availability with no single point of failure.
• Facebook had developed Cassandra in order to meet the reliability and
scalability needs.
• The reason behind it was that it was designed to fulfil the Storage needs of
the inbox search problem (inbox search enables users to search through
their Facebook inbox ).
MySQL Cassandra
KEY POINTS
• NoSQL follows Key-value stores.
• NoSQL is capable in partitioning a database by introducing more and more servers.
• NoSQL is schemaless.
• NoSQL allows in replication which helps in case of loss of data.
• A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is
highly structured.
• Row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long. Every operation
under a single row key is atomic per replica no matter how many columns are being read or written into.
• Columns are grouped together into sets called column families.
• Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families
can be visualized as a column family within a column family.
The design goal of Cassandra is to handle big data workloads across multiple nodes
without any single point of failure.
• All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests, regardless of where the
data is actually located in the cluster.
• When a node goes down, read/write requests can be served from other nodes in
the network.
ARCHITECTURE
• Partitioning : One of the key design features for Cassandra is the ability to scale incrementally. This
requires, the ability to dynamically partition the data over the set of nodes (i.e., storage hosts) in the
cluster.
• Replication : Cassandra stores replicas on multiple nodes to ensure
reliability and fault tolerance. A replication strategy determines the nodes
where replicas are placed. The total number of replicas across the cluster is
referred to as the replication factor.
• A replication factor of 1 means there is only 1 copy of row stored in a cluster.
TABLE OPERATIONS
CREATING A TABLE
.
TABLE OPERATIONS
ALTERING A TABLE
ALTER(TABLE| COLUMNFAMILY) <tablename> <instruction>
• Adding a column
ALTER TABLE table name
ADD new column datatype;
• Dropping a column
ALTER TABLE table name
DROP column name;
DROPPING A TABLE
DROP TABLE<tablename>
APPLICATIONS
• Messaging - Cassandra is a great database which can handle a big amount of data.
So it is preferred for the companies that provide Mobile phones and messaging
services. These companies have a huge amount of data, so Cassandra is best for
them.
• Handle high speed Applications - Cassandra can handle the high speed data so it is
a great database for the applications where data is coming at very high speed from
different devices or sensors.
• Product Catalogs and retail apps - Cassandra is used by many retailers for durable
shopping cart protection and fast product catalog input and output.
• Social Media Analytics and recommendation engine - Cassandra is a great
database for many online companies and social media providers for analysis and
recommendation to their customers.
CASE STUDY-
HOW UBER MANAGES A MILLION WRITES PER SECOND USING MESOS AND CASSANDRA ACROSS MULTIPLE
DATACENTERS
• Since 2010, over 14 billion rides have been serviced to the customers and a lot of
data has been generated and processed every single day.
• They built their own system that runs Cassandra on top of Mesos.
MESOS
•Mesos is Data Center OS that allows you to program against your datacenter like it’s a single
pool of resources.
•At the time Mesos was proven to run on 10s of thousands of machines, which was one of
Uber’s requirements, so that’s why they chose Mesos. Today Kubernetes could probably work
too.
•Uber has build their own sharded database on top of MySQL, called Schemaless.
•The idea is Cassandra and Schemaless will be the two data storage options in Uber.
•Uber has about 20 Cassandra clusters now and plans on having 100 in the future.
WHY IS MESOS AND CASSANDRA USED?
• Uber found there was hardly any difference, 5-10% overhead, between
running Cassandra on bare metal versus running Cassandra in a container
managed by Mesos.
• Performance is good: mean read latency: 13 ms and write latency: 25 ms,
For their largest clusters they are able to support more than a million
writes/sec and ~100k reads/sec.
• It’s very easy to create and run workloads across clusters.
SPECIFIC USAGE OF CASSANDRA.
• Geospatial Data
• Real Time Analytics
• Caching and Quick Data Retrieval
• Data Sharding
• Fault Tolerance
• Consistency and Reliability
• Scalability and High Availability
PERSONALIZATION AT SPOTIFY USING CASSANDRA
After developing and testing, they implemented and successfully rolled in several production
Cassandra clusters in Instagram and the latency was much lower and consistent