0% found this document useful (0 votes)
18 views

PPT 2.2.1

The document outlines the curriculum for a Big Data Analytics course at Chandigarh University, covering key topics such as Big Data frameworks, NoSQL databases, and real-time data processing. It includes an overview of SQL vs. NoSQL, the CAP theorem, and specific technologies like MongoDB and DynamoDB. The course outcomes emphasize understanding Big Data fundamentals, mastering architecture and tools, and implementing real-time data analytics and visualization.

Uploaded by

Ansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

PPT 2.2.1

The document outlines the curriculum for a Big Data Analytics course at Chandigarh University, covering key topics such as Big Data frameworks, NoSQL databases, and real-time data processing. It includes an overview of SQL vs. NoSQL, the CAP theorem, and specific technologies like MongoDB and DynamoDB. The course outcomes emphasize understanding Big Data fundamentals, mastering architecture and tools, and implementing real-time data analytics and visualization.

Uploaded by

Ansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Computer Science & Engineering


CHANDIGARH UNIVERSITY, MOHALI

BIG Data Analytics


21CSH-471

BY : Urvashi

Assistant Professor (Chandigarh


University)
Contents to be covered in UNIT
2
UNIT-2 Big Data Technologies Contact Hours:15

Chapter-1 Big Data Frameworks: Hadoop, Apache Spark, and their Comparison; NoSQL databases: MongoDB,
Big Data Cassandra, and HBase; Big Data Visualization Tools: Tableau, Power BI, and Zeppelin; Real-Time Big
Frameworks Data Processing: Apache Storm and Flink; Emerging trends in Big Data Technologies.

Overview of SQL vs. NoSQL: Differences and Use Cases; Introduction to Big SQL: Big SQL Features –
Chapter – 2 Scalability, support for structured and unstructured data, Query optimization Techniques in Big
Big SQL and SQL; NoSQL Database Types: Key-Value stores (Redis, DynamoDB), Document stores (MongoDB,
NO SQL CouchDB), Column-family stores (Cassandra, HBase), Graph Databases (Neo4j); Advantages and
Databases limitations of Big SQL and NoSQL.

Chapter – 3 Introduction to IBM Watson: Overview and capabilities of Watson AI, Watson’s role in Big data and
AI in Big Data decision-making; Key Watson Services: Watson Discovery, Watson Studio, and Watson Assistant,
Integration of Watson with Big Data tools; AI and Machine Learning Applications in Big Data:
Natural Language Processing (NLP), Sentiment Analysis and Predictive Analytics.
Course Outcomes

CO1 Understand the Fundamentals of Big Data.

CO2 Master Big Data Architecture and Tools

CO3 Explore the Hadoop Ecosystem and Data Processing Models

CO4 Develop Data Science Skills and Tools

CO5 Implement Real-Time Data Analytics and Visualization

3
NoSQL databases and Big Data Features
Introduction
• NOSQL
• Not only SQL
• Most NOSQL systems are distributed databases
or distributed storage systems
• Focus on semi-structured data storage,
high performance, availability, data replication,
and scalability
Introduction
(cont'd.)
NOSQL systems focus on storage of “big data”
• Typical applications that use NOSQL
• Social media
• Web links
• User profiles
• Marketing and sales
• Posts and tweets
• Road maps and spatial data
• Email
Introduction to NOSQL Systems
• BigTable
• Google's proprietary NOSQL system
• Column-based or wide column store
• DynamoDB (Amazon)
• Key-value data store
• Cassandra (Facebook)
• Uses concepts from both key-value
store and column-based systems
Introduction to NOSQL
Systems
Categories of NOSQL systems
• Document-based NOSQL systems
• NOSQL key-value stores
• Column-based or wide column NOSQL systems
• Graph-based NOSQL systems
• Hybrid NOSQL systems
• Object databases
• XML databases
The CAP Theoram:

• Various levels of consistency among replicated


data items
• Enforcing serializabilty the strongest
form of consistency
• High overhead — can reduce read/write
operation performance
• CAP theorem
• Consistency, availability, and partition tolerance
• Not possible to guarantee all three
simultaneously
- In distributed system with data replication
The CAP Theorem (cont'd.)
• Designer can choose two of three to guarantee
• Weaker consistency level is often
acceptable in NOSQL distributed data store
• Guaranteeing availability and partition
tolerance more important
• Eventual consistency often adopted
Document Based NOSQL
Systems and MongoDB
• Document stores
• Collections of similar documents
• Individual documents resemble complex objects
or XML documents
• Documents are self-describing
• Can have different data elements
• Documents can be specified in various formats
• XML
• JSON
MongoDB Data Mode
Documents stored in binary JSON (BSON)
format

Individual documents stored in a collection


• Example command
• First parameter specifies name of the collection
• Collection options include limits on size and
number of documents

• Each document in collection has unique


ObjectlD field called id
MongoDB Data Model (cont'd.)
• A collectiondoes not have a schema
• Structure of the data fields in documents
chosen based on how documents will be
accessed
• User can choose normalized or
denormalized design
• Document creation using insert operation
db.<collection_name>.insert(<document(s)>)
• Document deletion using remove operation
db.<collection_name>.remove(<condition>)
(a\ project document with an array of embedded
workers:
id: ”P1”,
Pnamc: -Product
Plocation X",
: Bellamy-.
Workers: ( Ename: *John
[ Smith”. Hours: 32.5
).
( Ename: “Joyce
English", Hours:
20.0
)
Figure 24.1 (continues)
Example of simple documents
(b) project document with an embedded array of worker
in MongoDB (a) Denormalized ids:
document design with
id: P1”
embedded subdocuments (b) Pname: -Product
Embedded array of document Plocation X”,
references : Bellzre".
Workerlds: ( “W1-.
I id. “W2"
W1" ]
Ename -John
: Smith*, 32.5
Hours:
l id: -W2".
Ename -Joyce
: English". 20.0
Hours:
(c) normalized project and worker documents (not a fully normalized
design for M:N relatlonshlps):

id: "P1”
Pnarne: Product
Plocation X”,
Figure 24.1 (cont'd.) : "Bellaire"
( id:
Example of simple "John Smith",
Ename
documents in : P1 ”,
MongoDB Projectld 32.5
(c)Normalized Hours
documents ( id: -W2-,
(d)Inserting the Ename: -Joyce
Projectld: English".
documents in
Hours: P1
Figure 24.1(c) into ) ”
their collections 20.
Id) inserting the documents In Act into thelr collections "project- and
0
“worker": db.project.inserts ( id. “P1". Pname: “ProductX", location:
"Bellaire" ) ii db.worker.insert( [ ( id: "W1". Ename: "John Smrh", Pro ectld:
*P1", Hours: 32.5 ).
( d: “W2". Ename: "Joyce
English". Projectld. "P1", Hours: 20.0 ) ] !
MongoDB Distributed Systems
Characteristics
• Two-phase commit method
• Used to ensure atomicity and
consistencyof multidocument transactions
• Replication in MongoDB
• Concept of replica set to create multiple
copies on different nodes
• Variation of master-slave approach
• Primary copy, secondary copy, and arbiter
- Arbiter participates in elections to select new
primary if needed
MongoDB Characteristics (cont'd.)
• Sharding in MongoDB (cont'd.)
• Partitioning field (shard key) must exist in
every document in the collection
• Must have an index
• Range partitioning
- Creates chunks by specifying a range of key values
• Works best with range queries
• Hash partitioning
• Partitioning based on the hash values of each
shared key.
NOSQL Key Value Stores

• Key-value stores focus on high performance,


availability, and scalability
• Can store structured, unstructured,
or semi- structured data
• Key: unique identifier associated with a data
item
• Used for fast retrieval
• Value: the data item itself
• Can be string or array of bytes
• Application interprets the structure
• No query language
DynamoDB Overview
• DynamoDB part of Amazon's Web Services/SDK
platforms
• Proprietary
• Table holds a collection of self-describing items
• Item consists of attribute-value pairs
• Attribute values can be single or multi-valued
• Primary key used to locate items within a table
• Can be single attribute or pair of attributes
Voldemofi Key-Value Distributed
Data Store
• Voldemort open source key-value system similar
to DynamoDB
• Voldemort features
• Simple basic operations (get, put, and delete)
• High-level formatted data values
• Consistent hashing for distributing (key,
value) pairs
• Consistency and versioning
• Concurrent writes allowed
• Each write associated with a vector clock
Range
3

Range Range
2 t

Range Range
3 2

Figure 24.2 Example of


consistent hashing (a) Ring
having three nodes A, B, and C, Range Range
with C having
capacity greater
. The h( values that map 1 3
the circle points in range 1 have
to Range
3
(k, v) items stored in node A,
their
in node
range 2 B, range 5 in node C (b) Range
Range
Adding a node D to the ring. Items 2 1
in range 4 are moved to the node
D from node B (range S is
Range
reduced) and node C |range Z is 4
reduced)
Range
4

Range — ”Range
1 c 3
Examples of other Key Value Stores
• Oracle key-value store
• Oracle NOSQL Database
• Redis key-value cache and store
• Caches data in main memory to improve
performance
• Offers master-slave replication and high availability
- Offers persistence by backing up cache to disk
• Apache Cassandra
• Offers features from several NOSQL categories
- Used by Facebook and others
NOSQL Systems - Column-Based or Wide
Column
• BigTable: Google's distributed storage system for
big data
• Used in Gmail
• Uses Google File System for data storage
and distribution
• Apache Hbase a similar, open source system
• Uses Hadoop Distributed File System
(HDFS) for data storage
• Can also use Amazon's Simple Storage System
(S3)
Reference Books
TEXT BOOKS

1. Mohammed Guller, Big Data Analytics with Spark, Apress,2015


2. Tom Mitchell, “Machine Learning”, McGraw Hill, 3rdEdition,1997
3. Michael Minelli, Michehe Chambers, “Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends for Today’s
Business”, 1stEdition, Ambiga Dhiraj, Wiely CIO Series, 2013.
4. Arvind Sathi, “Big Data Analytics: Disruptive Technologies for
Changing the Game”,1st Edition, IBM Corporation, 2012.

REFERENCE BOOKS

5. Chris Eaton, Dirk deroos et al., “Understanding Big data”, McGraw


Hill, 2012.
6. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet
Publishing 2013.
7. JyLiebowitz, “Big Data and Business Analytics”, CRC press, 2013.
For more insight
Web sources 
1. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
2. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
3. https://www.coursera.org/articles/
big-data-technologies?
utm_source=chatgpt.com
4. https://careerfoundry.com/en/ Big Data Big Big Data and
Analytics Analytics
blog/data-analytics/where-to-find- Wiley
free-datasets/?
utm_source=chatgpt.com
THANK YOU

For queries
Email: [email protected]

You might also like