Big Data Analytics Module-3
Big Data Analytics Module-3
When the datasets replicate at number of data nodes, the fault tolerance
increases further.
3. Sharding is storing the different parts of data onto different sets of data
nodes, clusters or servers.
Examples:
Class refers to a template of program codes that is extendable. Class creates instances,
called objects. A class consists of initial values for member fields, called state (of
variables), and implementations of member functions and methods called behavior. An
implementation means program codes along with values of arguments in the functions
and methods (Java Class uses methods, C++ functions.)
Object is an instance of a class in Java, C++, and other object-oriented languages.
Object can be an instance of another object (for example, in JavaScript).
Tupple is an ordered set of data which constitutes a record. For example, one row
record in a table. A row in a relational database has column fields or attributes.
Example of a tupple is (JLRWSale, Week 1, 138, Week 2, 232, ..., week 52, 186) in
an RDBMS table. Here, JLRWSale means Jaguar Land Rover Weekly Sale. (JLRWSale,
Week 1, 138) is also a tupple, and gives JLR week 1 sales = 138. (Week 2, 232, ...,
week 52, 186) means week 2 sales = 232 abd 52 sales = 186 JLRs.
Transaction means execution of instructions in two interrelated entities, such as a
query and the database.
Oracle refers to a widely used object-relational DBMS, written in the C++ language
that provides applications integration with service-oriented architectures and has
high reliability. Oracle has also released the NoSQL database system.
DB2 refers to a family of database server products from IBM with built-in support to
handle advanced Big Data analytics.
Sybase refers to database server based on relational model for businesses,
primarily on UNIX. Sybase was the first enterprise-level DBMS in Linux.
SQL creates databases and RDBMSs uses tabular data store with
relational algebra.
Joins, if and only if a given Join condition satisfies. Number of Join operations
specify using relational algebraic expressions.
SQL provides JOIN clause, which retrieves and joins the related data stored
across multiple tables with a single command, Join.
The statement selects those records in a column named KitKatSales which match
the values in two tables: one TransactionsTbl and other ACVMSalesTbl.
RDBS Issues:
Relational databases and RDBMS developed using SQL have issues of scalability
and distributed design. This is because all tuples need to be on the same data
node.
The traditional RDBMS has a problem when storing the records beyond a certain physical
storage limit. This is because RDBMS does not support horizontal scalability
Example
Consider sharding a big table in a DBMS into two. Assume writing first 0.1
million records (1 to 100000) in one table and from 100001 in another
table.
Sharding a database means breaking up into many, much smaller databases that
share nothing, and can distribute across multiple servers. Handling of the Joins and
managing data in the other related tables are cumbersome processes, when using
the sharding.
The problem continues when data has no defined number of fields and formats.
For example, the data associated with the choice of chocolate flavours of the users
of ACVM. Some users provide a single choice, while some users provide two
choices, and a few others want to fill three best flavours of their choice.
User Id Choice
1 Dairy Milk
Defining a field becomes tough when a field in the database offers choice
between two or many. This makes RDBMS unsuitable for data management in
Big Data environments as well as data in their real forms.
3.2.1 NoSQL
`NoSQL' term conveys two different meanings:
(ii) "Not only SQL" use SQL compliant formats with variety of other
querying and access methods.
NoSQL DB does not require specialized RDBMS like storage and hardware for
processing.
- key-value pairs,
- name/value pairs,
- JSON (MongoDB), Graph Store, Object Store, ordered keys and semi-structured data
storage systems.
2. NoSQL not necessarily has a fixed schema, such as table;
do not use the concept of Joins (in distributed data storage systems);
Data written at one node can be replicated to multiple nodes. Data store is
thus fault-tolerant.
HDFS compatible; master-slave distribution model; document-oriented data store with JSON-like
Apache's documents and dynamic schemas; open-source, NoSQL, scalable and non-relational database; used by
MongoDB Websites Craigslist, eBay, Foursquare at the backend
HDFS compatible DBs; decentralized distribution peer-to-peer model ; open source; NoSQL;
Apache's scalable, non-relational, column-family based, fault-tolerant and tune able consistency, used by
Cassandra Facebook and Instagram
A project of Apache which is also widely used database for the web. CouchDB consists of Document
Apache's Store. It uses the JSON data exchange format
CouchDB to store its documents, JavaScript for indexing, combining and transforming documents, and HTTP APIs
Oracle
Step towards N0SQL data store; distributed key-value data store; provides transactional semantics for
NoSQL
data manipulation, horizontal scalability, simple administration and monitoring
An open-source key-value store; high availability (using replication concept), fault tolerance,
Riak
operational simplicity, scalability and written in Erlang
CAP Theorem
Among C, A and P, two are at least present for the
application/service/process.
Consistency means all copies have the same value like in traditional DBs.
Partition means parts which are active but may not cooperate (share) as in
distributed DBs.
1. Consistency in distributed databases means that all nodes observe the same
data at the same time. Therefore, the operations in one partition of the
database should reflect in other related partitions in case of distributed
database.
2. Availability means that during the transactions, the field values must be
available in other partitions of the database so that each request receives a
response on success as well as failure. (Failure causes the response to request
from the replicate of data). Distributed databases require transparency between
one another. Network failure may lead to data unavailability in a certain partition in
case of no replication. Replication ensures availability.
1. Consistency- All nodes observe the same data at the same time.
• Database must answer, and that answer would be old or wrong data
(AP).
The CAP theorem implies that for a network partition system, the choice of consistency and
availability are mutually exclusive.
CA means consistency and
availability,
AP means availability and
partition tolerance and
CP means consistency and
partition tolerance.
3.2.2 Schema-less Models
• NoSQL data not necessarily have a fixed table schema. The systems do not use
the concept of Join (between distributed datasets).
• Data written at one node replicates to multiple nodes. Therefore, these are
identical, fault-tolerant and partitioned into shards.
• NoSQL data model offers relaxation in one or more of the ACID properties
(Atomicity, consistence, isolation and durability) of the database.
• Follows CAP theorem “states that out of the three properties, two must at least be
present for the application/service/process”
Characteristics of Schema-less model
Meta Data
NoSQL data stores use non-mathematical relations but store this information as an
aggregate called metadata.
Metadata is a record with all the information about a particular dataset and the
inter-linkages.
BASE is a flexible model for NoSQL data stores. Provisions of BASE increase
flexibility.
BASE Properties BA stands for basic availability, S stands for soft state and E
stands for eventual consistency.
BASE Properties
1. Basic availability ensures by distribution of shards (many partitions of huge data
store) across many data nodes with a high degree of replication. Then, a segment failure does
not necessarily mean a complete data store unavailability.
2. Document Stores
3. Tabular Data
5. Graph Database
3.3.1 Key-Value Store
• The simplest way to implement a schema-less data store is to use key-value
pairs.
• The data store characteristics are high performance, scalability and flexibility.
• The concept is similar to a hash table where a unique key points to a particular
item(s) of data.
key-value pairs architectural pattern
Advantages of a key-value store are as follows:
1. Data Store can store any data type in a value field.
2. A query just requests the values and returns the values as a single item. Values can be of
any data type.
5. Returned values on queries can be used to convert into lists, table-columns, data-frame
fields and columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operational cost.
7. The key can be synthetic or auto-generated. The key is flexible and can be represented in
many formats: (i) Artificially generated strings created from a hash of a value, (ii) Logical path
names to images or files
The key-value store provides client to read and write
values using a key as follows:
(i) Get (key) , returns the value associated with the key.
(ii) Put (key, value), associates the value with the key and updates a value if
this key is already present.
(iii) Multi-get (key1, key2, .., keyN), returns the list of values associated with
the list of keys.
(iv) Delete (key) , removes a key and its value from the data store.
Limitations of key-value store architectural pattern
are:
(i) No indexes are maintained on values, thus a subset of values is not
searchable.
(ii) Key-value store does not provide traditional database capabilities, such as
atomicity of transactions, or consistency when multiple transactions are
executed simultaneously. The application needs to implement such capabilities.
(iii) Maintaining unique values as keys may become more difficult when the
volume of data increases. One cannot retrieve a single result when a key-value
pair is not uniquely identified.
• Some other widely used key-value pairs in NoSQL DBs are Amazon's
DynamoDB, Redis (often referred as Data Structure server), Memcached and
its flavours, Berkeley DB, upscaledb (used for embedded databases), project
Voldemort and Couchbase.
3.3.2 Document Store
Characteristics of Document Data Store are:
4. Querying is easy. For example, using section number, sub-section number and
figure caption and table headings to retrieve document partitions.
JSON represents object-oriented and hierarchical data records, object, and resource
arrays in JavaScript.
Example
Assume Preeti gave examination in Semester 1 in 1995 in four subjects. She gave
examination in five subjects in Semester 2 and so on in each subsequent semester.
Another student, Kirti gave examination in Semester 1 in 2016 in three subjects,
out of which one was theory and two were practical subjects. Presume the subject
names and grades awarded to them.
(i) Write two CSV files for cumulative grade-sheets for both the students. Point
the difficulty during processing of data in these two files.
SOLUTION
CSV file for Preeti consists of the following nine lines each with four
• (5 x 5) = 25 for kirtiGradeSheet.csv.
• Therefore, when processing student records, merger of both files into a single file
will need a program to extract the key-value pairs separately, and then prepare a
single file.
(ii) Write a file in JSON format with each student grade-sheet as an object instance.
How does the object-oriented and hierarchical data record in JSON make processing
easier?
SOLUTION
JSON gives an advantage of creating a single file with multiple instances and
inheritances of an object.
XML is widely used in data store and data exchanges over the network.
XML is semi-structured.
Document JSON Format CouchDB Database
Its features are:
2. CouchDB uses JSON Data Store model for documents. Each document maintains
separate data and metadata (schema).
3. CouchDB is a multi-master application. Write does not require field locking when
controlling the concurrency during multi-master application.
5. CouchDB accesses the documents using HTTP API. HTTP methods are Get, Put and
Delete
6. CouchDB data replication is the distribution model that results in fault tolerance
and reliability.
Document JSON Format—MongoDB Database
MongoDB Document database provides a rich query language and constructs,
such as database indexes allowing easier handling of Big Data.
• For example, it is possible to search the document where student's first name is
"Ashish".
• Document store can also provide the search value's exact location.
• Since the document stores are schema-less, adding fields to documents (XML
or JSON) becomes a simple task.
XML document architecture pattern
• An XML document architecture pattern is a document fragment and
document tree structure.
• Each branch has a related path expression that provides a way to navigate
from the root to any given branch, sub-branch or value.
XQuery and XPath are query languages for finding and extracting elements and
attributes from XML documents.
XPath treats XML document as a tree of nodes. XPath queries are expressed in the
form of XPath expressions.
Example
Give examples of XPath expressions. Let outermost element of the XML document is a.
SOLUTION
• An XPath expression /a/b/c selects c elements that are children of b elements that
are children of element a that forms the outermost element of the XML document.
• An XPath expression /a/b[c=5] selects elements b and c that are children of a and
value of c element is 5.
• XML is used to describe structured data and does not include arrays,
whereas JSON includes arrays.
• JSON has basically key-value pairs and is easier to parse from JavaScript.
• The concise syntax of JSON for defining lists of elements makes it preferable
for serialization of text format objects.
Benefits of Document Collection
1. Group the documents together, similar to a directory structure in a file-
system. (A directory consists of grouping of file folders.)
Row-head field may be used as a key which access and retrieves multiple values from
the successive columns in that row.
In-memory row-based data is the example for row oriented data, in which a key
in the first column of the row is at a memory address, and values in successive columns
at successive memory addresses.
That makes OLTP easier. All fields of a row are accessed at a time together during OLTP.
Column-based data Tabular Data:
In-memory column-based data has the keys (row-head keys) in the first row is the
key of the each column.
The next column of each row after the key has the values at successive memory
addresses.
All fields of a column can be accessed together. All fields of a set of columns may
also be accessed together during OLAP.
Example
Solution
Advantages of column stores are:
1. Scalability: The database uses row IDs and column names to locate a column
and values at the column fields. The back-end system can distribute queries over
a large number of processing nodes without performing any Join operations.
6. Querying all the field values in a column in a family, all columns in the family
or a group of column-families, is fast in in-memory column-family data store.
8. No optimization for Join: Column-family data stores are similar to sparse matrix
data. The data do not optimize for Join operations.
Examples of widely used column-family data store:
Google's BigTable, HBase and Cassandra.
3. Compatibility with MapReduce, HBase APIs which are open-source Big Data platforms.
9. BigTable, being Google's cloud service, has global availability and its service is seamless.
3.3.3.4 ORC File Format
• ORC (Optimized Row Columnar).
• ORC is an intelligent Big Data file format for HDFS and Hive.
• The columnar layout in each ORC file thus, optimizes for compression and enables
skipping of data in columns.
• The throughput increases due to skipping and reading of the required fields
at contents-column key.
In order to understand Parquet file format in Hadoop better, first let’s see what is
columnar format.
If we take the same record schema as mentioned above having three fields ID
(int), NAME (varchar) and Department (varchar)
row wise storage format :
For this table in a row wise storage format the data will be stored as follows-
• If you want only the NAME column. In a row storage format each record in the
dataset has to be loaded, parsed into fields and then data for Name is
extracted.
• With column oriented format it can directly go to Name column as all the values
for that columns are stored together and get those values. No need to go
through the whole record.
• Column oriented format increases the query performance as less seek time is
required to go the required columns and less IO is required as it needs to read
only the columns whose data is required.
• Another benefit that you get is in the form of less storage. Compression
works better if data is of same type. With column oriented format columns of the
same type are stored together resulting in better compression.
Parquet format
Parquet file format is also a column oriented format so it brings the same
benefit of improved performance and better compression.
One of the unique feature of Parquet is that it can store data with nested
structures also in columnar fashion.
3.3.4 Object Data Store
An object store refers to a repository which stores the:
(i) scalability,
(ii) indexing,
(v) Transactions,
(vi) data replication for high availability, data distribution model, data integration
(such as with relational database, XML, custom code),
(vii) schema evolution,
(viii) persistency,
Amazon S3 (Simple Storage Service) S3 refers to Amazon web service on the cloud named S3.
The Object Store differs from the block and file-based cloud storage.
The service has two storage classes: Standard and infrequent access.
S3 uses include web hosting, image hosting and storage for backup systems.
The yearly sales compute by path traversals from nodes for weekly sales to yearly
sales data.
(iii) Solution:
The path traversals exhibit BASE properties because during the intermediate paths,
consistency is not maintained. Eventually when all the path traversals complete,
the data becomes consistent.
Typical uses of graph databases are:
• They are difficult to scale out on multiple servers. This is due to the close
connectivity feature of each node in the graph.
• Write operations to multiple servers and graph queries that span multiple nodes, can be
complex to implement.
Examples of graph DBs are
Neo4J,
AllegroGraph,
HyperGraph,
Infinite Graph,
Titan
and FlockDB.
3.4 NOSQL TO MANAGE BIG DATA
Using NoSQL to Manage Big Data
NoSQL
(i) limits the support for Join queries, supports sparse matrix like columnar-
family,
(ii) Has characteristics of easy creation and high processing speed, scalability and
storability of much higher magnitude of data (terabytes and petabytes).
(iii) NoSQL sacrifices the support of ACID properties, and instead supports CAP and
BASE properties.
5. Usages of open-source tools: NoSQL data stores are cheap and open source.
Database implementation is easy and typically uses cheap servers to manage the exploding
data and transaction while RDBMS databases are expensive and use big servers and storage
systems. So, cost per gigabyte data store and processing of that data can be many times less
than the cost of RDBMS.
6. Support to schema-less data model: NoSQL data store is schema less, so data can
be inserted in a NoSQL data store without any predefined schema. So, the format or data
model can be changed any time, without disruption of application. Managing the changes is a
difficult problem in SQL.
7. Support to integrated caching: NoSQL data store support the caching in
system memory. That increases output performance. SQL database needs a
separate infrastructure for that.
8. No inflexibility unlike the SQL/RDBMS, NoSQL DBs are flexible (not rigid)
and have no structured way of storing and manipulating data. SQL stores in the
form of tables consisting of rows and columns. NoSQL data stores have flexibility in
following ACID rules.
3.4.1.2 Types of Big Data Problems
The following types of problems are faced using Big Data solutions.
1. Big Data need the scalable storage and use of distributed servers together as a
cluster. Therefore, the solutions must drop support for the database Joins
2. NoSQL database is open source and that is its greatest strength but at the same time its
greatest weakness also because there are not many defined standards for NoSQL data stores.
Hence, no two NoSQL data stores are equal.
For example:
(ii) GUI mode tools to access the data store are not available in the market
(iv) NoSQL data stores sacrifice ACID compliancy for flexibility and processing speed.
NoSQL vs RDBMS
NOSQL RDBMS
SHARED-NOTHING ARCHITECTURE FOR BIG DATA
TASKS
• The columns of two RDBMS tables relate by a relationship.
• Shared nothing (SN) is a cluster architecture. A node does not share data
with any other node.
4. No network contention
3.5.1 Choosing the Distribution Models
Big Data requires distribution of data on multiple data nodes at clusters.
Distributed software components give advantage of parallel processing, providing
horizontal scalability.
Distribution gives
(iii) A resource manager manages, allocates, and schedules the resources of each
processor, memory and network connection.
(iv) Distribution increases the availability when a network slows or link fails.
Four distribution models data store:
cluster.
• In case of a link failure, the application can migrate the shard DB to another
node.
3.5.1.3 Master-Slave Distribution Model
A process uses the slaves for read operations and write is done in master.
(1) All replication nodes accept read request and send the responses.
(2) All replicas function equally [read support and write support also].
(3) Node failures do not cause loss of write capability, as other replicated node responds.
Benefits:
• Since nodes read and write both, a replicated node also has updated data. Therefore, the
biggest advantage in the model is consistency.
Peer-to-Peer Distribution Model [PPD Model]:
Shards replicating on the nodes, which does read and write
operations both
3.5.2 Ways of Handling Big Data Problems
Four ways for handling Big Data problems:
1. Evenly distribute the data on a cluster using the hash rings:
• Uses the hashing algorithm which generates the pointer to the data
collection.
(i) non-relational,
(ii) NoSQL,
(iii) distributed,
(vi) cross-platform,
(vii) Scalable,
(ix) Indexed,
3. Document model is well defined. Structure of document is clear, Document is the unit of
storing data in a MongoDB database. Documents are analogous to the records of RDBMS table.
Insert, update and delete operations can be performed on a collection. Document use JSON
(JavaScript Object Notation) approach for storing data. JSON is a lightweight, self-describing format
used to exchange data between various applications. JSON data basically has key-value pairs.
Documents have dynamic schema.
4. MongoDB is a document data store in which one collection holds
different documents. Data store in the form of JSON-style documents. Number
of fields, content and size of the document can differ from one document to
another.
5. Storing of data is flexible, and data store consists of JSON-like documents. This
implies that the fields can vary from document to document and data
structure can be changed over time; JSON has a standard structure, and scalable
way of describing hierarchical data
9. No complex Joins.
A new primary node can be chosen among the secondary nodes at the time of
automatic failover or maintenance.
The failed node when recovered can join the replica set as secondary node again.
Following are the commands used for replication (Recoverability means even on
occurrences of failures; the transactions ensure consistency).
Auto-sharding
• Sharding is a method for distributing data across multiple machines in a distributed
application environment.
• Vertical scaling by increasing the resources of a single machine is quite expensive. Thus,
horizontal scaling of the data can be achieved using sharding mechanism where
more database servers can be added to support data growth and the demands of more read
and write operations.
• Sharding automatically balances the data and load across various servers. Sharding provides
additional write capability by distributing the write load over a number of mongod (MongoDB
Server) instances.
• DB has a 1 terabyte dataset distributed amongst 20 shards, then each shard contains only 50 Giga Byte of
data.
Data types which MongoDB documents support
MongoDB Querying Commands
To Create database:
For example, Command use lego creates a database named lego. (A sample database is
created to demonstrate subsequent queries.
Command is show dbs — This command shows the names of all the databases.
To drop database:
To create a collection
• Cassandra is basically a column family database that stores and handles massive
data of any format including structured, semi-structured and unstructured data.
(ii) scalable
(iii) non-relational
(iv) NoSQL
(v) Distributed
(vii) decentralized,
4. Is fast and easily scalable as write operations spread across the cluster.
The cluster does not have a master-node, so any read and write can be
handled by any node in the cluster.
6. Uses PPD (Peer to Peer Data distribution model) Data distribution model
Data Replication
• Cassandra stores data on multiple nodes (data replication) and thus has no single
point of failure, and ensures availability, a requirement in CAP theorem.
• Cassandra returns the most recent value of the data to the client.
• If it has detected that some of the nodes responded with a stale value, Cassandra
performs a read repair in the background to update the stale values.
Components of cassandra
Scalability:
Cassandra provides linear scalability which increases the throughput and decreases
the response time on increase in the number of nodes at cluster.
Replication Option:
Specifies any of the two replica placement strategy. The strategy names are
Simple Strategy or Network Topology Strategy.
2. Network Topology Strategy: Allows setting the replication factor for each
data center independently.
Data types built into Cassandra, their usage and
description
Cassandra Data Model consists of four main
components:
(i) Cluster: Made up of multiple nodes
and keyspaces,
(iv) Column-family: multiple columns with row key reference. Cassandra does
keyspace management using partitioning of keys
DESCRIBE SCHEMA
DESCRIBE KEYSPACES
DESCRIBE TABLES
DESCRIBE FUNCTIONS
DESCRIBE AGGREGATES
ALL, ANY, ONE, TWO, THREE, QUORUM, LOCAL_ONE, LOCAL_QUORUM, EACH_QUORUM, SERIAL AND LOCAL_SERIAL.
1. ALL: Highly consistent. A write must be written to commitlog and memtable on all replica nodes in the
cluster.
2. EACH_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in all
data centers.
3. LOCAL_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in the
same center.
4. ONE: A write must be written to commitlog and memtable of at least one replica node.
5. TWO, THREE: Same as One but at least two and three replica nodes, respectively.
6. LOCAL_ONE: A write must be written for at least one replica node in the
local data center.
9. LOCAL SERIAL: Same as Serial but restricted to the local data center.
Keyspaces
Keyspaces: A keyspace (or key space) in a NoSQL data store is an object that contains all
column families data as a bundle.
CREATE KEYSPACE statement has attributes replication with option class and replication
factor, and durable_write.
Default value of durable_writes properties of a table is set to true. This commands the
Cassandra to use Commit Log for updates on the current Keyspace. The option is not
compulsory.
1. ALTER KEYSPACE command changes (alter) properties, such as the number of replicas and the
durable_writes of a keyspace:
4. Re-executing the drop command to drop the same keyspace will result in configuration
exception.