bda module 3
bda module 3
Introduction:
Big Data uses distributed systems. A distributed system consists of multiple data nodes at clusters.
The tasks execute in parallel with data at nodes in clusters. The computing nodes communicate
with the applications through a network.
1. Increased reliability and fault tolerance. If a segment of machines in a cluster fails then the
rest of the machines continue work. When the datasets replicate at a number of data nodes, the
fault tolerance increases further.
2. Flexibility makes it very easy to install, implement and debug new services in a distributed
3. Sharding is storing the different parts of data onto different sets of data nodes, clusters or
servers. For example, university students huge database, on sharding divides in databases, called
shards. Each shard may correspond to a database for an individual course and year.
4. Speed: Computing power increases in a distributed computing system as shards run parallelly
on individual data nodes in clusters independently.
5. Scalability: Consider sharding of a large database into a number of shards, distributed for
computing in different systems. When the database expands further, then adding more machines
and increasing the number of shards provides horizontal scalability. Increased computing power
and running number of algorithms on the same machines provides vertical scalability.
6.Resources sharing: Shared resources of memory, machines and network architecture reduce the
cost.
8. Performance: The collection of processors in the system provides higher performance than a
centralized computer, due to lesser cost of communication among machines.
1
The demerits of distributed computing:
Transactions on SQL databases exhibit ACID properties. ACID stands for atomicity,
consistency, isolation and durability.
Atomicity of transaction means all operations in the transaction must complete, and if
interrupted, then must be undone (rolled back). For example, if a customer withdraws an
amount then the bank in first operation enters the withdrawn amount in the table and in the
next operation modifies the balance with a new amount available. Atomicity means both should
be completed, else undone if interrupted in between.
Consistency in transactions means that a transaction must maintain the integrity constraint,
and follow the consistency principle. For example, the difference of sum of deposited amounts
and withdrawn amounts in a bank account must equal the last balance. All three data need to
be consistent.
Isolation of transactions means two transactions of the database must be isolated from each
other and done separately.
Trigger is a special stored procedure. Trigger executes when a specific action(s) occurs within
a database, such as change in table data or actions such as UPDATE, INSERT and DELETE.
View refers to a logical construct, used 1 query statements. A View saves a division of complex
query statement instructions and that reduces the query complexity. Viewing of a division is
similar to a view of a table.
2
schedule. Scheduled order of instructions is maintained during the transaction. Scheduling
enables execution of multiple transactions in allotted time intervals.
Join refers to a clause which combines. Combining the products (AND operations) follows
next the selection process. A Join operation does pairing of two tuples obtained from different
relational expressions. Joins, if and only if a given Join condition satisfies.
SQL compliant format means that database tables constructed using SQL and they enable
processing of the queries written using SQL. 'NoSQL term conveys two different meanings:
(i) does not follow SQL compliant formats,
(ii)"Not only SQL" use SQL compliant formats with a variety of other querying and access
methods.
NoSQL
A new category of data stores is NoSQL (means Not Only SQL) data stores. NoSQL is an
altogether new approach to thinking about databases, such as schema flexibility, simple
relationships, dynamic schemas. auto sharding, replication, integrated caching, horizontal
scalability of shards, distributable tuples, semi structured data and flexibility in approach.
Issues with NoSQL data stores are lack of standardization in approaches, processing difficulties
for complex queries.
NoSQL data stores are considered semi-structured data. Big Data Store uses NoSQL NoSQL
or Not Only SQL is a Class of non-relational data storage systems, flexible data models and
multiple schema with SQL. NoSQL data store characteristics are as follows
1. NoSQL is a class of non-relational data storage system with flexible data model. Examples
of NoSQL data-architecture patterns of datasets are key-value pairs, name/value pairs, Column
family Big-data store, Tabular data store, Cassandra (used in Facebook/Apache), HBase, hash
table [Dynamo (Amazon S3)], unordered keys using JSON (CouchDB), JSON (PNUTS),
JSON (MongoDB), Graph Store, Object Store, ordered keys and semi-structured data storage
systems.
3
2. NoSQL not necessarily has a fixed schema, such as table, do not use the concept of Joins,
Data written at one node can be replicated to multiple nodes. Data store is thus fault-tolerant.
The store can be partitioned into unshared shards.
4
CAP Theorem : Among C,A and P two are at least present for the application service process.
Consistency means all copies have the same value like in traditional DB,Availability means at least
one copy is available in case a partition becomes active or fails, Partition means parts which are
active but may not cooperate (share) as in distributed DBs.
1.Consistency in distributed database means that all nodes observe the same data at the same time.
Therefore, the operations in one partition of the database should reflect in other related partitions
in case of distributed database Operations, which change the sales data from a specific showroom
in a table should also reflect in changes in related tables which are using that sales data.
2. Availability means that during the transactions, the field values must be available in other
partitions of the database so that each request receives a response on success as well as failure.
Replication ensures availability.
3 Partition means division of a large database into different databases without affecting the
operations on them by adopting specified procedures.
Partition tolerance: Refers to continuation of operations as a whole even in case of message loss,
node failure or node not reachable
Brewer's CAP (Consistency. Availability and Partition Tolerance) theorem demonstrates that any
distributed system cannot guarantee C. A and P together.
The CAP theorem implies that for a network partition system, the choice of consistency and
availability are mutually exclusive. CA means consistency and availability, AP means availability
and partition tolerance and CP means consistency and partition tolerance. Figure 3.1 shows the
CAP theorem usage in Big Data Solutions
5
Figure 3.1 CAP theorem in Big Data solution
Schema-less Models
Schema of a database system refers to designing of a structure for datasets and data structures for
storing into the database. NoSQL data not necessarily have a fixed table schema. The systems do
not use the concept of Join (between distributed datasets). A cluster-based highly distributed node
manages a single large data store with a NoSQL DB.
Data written at one node replicates to multiple nodes. Therefore, these are identical, fault-tolerant
and partitioned into shards. Distributed databases can store and process a set of information on
more than one computing nodes.
NoSQL data model offers relaxation in one or more of the ACID properties (Atomicity,
consistence, isolation and durability) of the database. Distribution follows CAP theorem. CAP
theorem states that out of the three properties, two must at least be present for the
application/service/process.
Figure 3.2 shows characteristics of Schema-less model for data stores. ER stands for entity-relation
modelling. Relations in a database build the connections between various tables of data. For
example, a table of subjects offered in an academic programme can be connected to a table of
6
programmes offered in the academic institution. NoSQL data stores use non-mathematical
relations but store this information as an aggregate called metadata.
Metadata refers to data describing and specifying an object or objects. Metadata is a record with
all the information about a particular dataset and the inter-linkages. Metadata helps in selecting an
object, specifications of the data and, usages that design where and when. Metadata specifies
access permissions, attributes of the objects and enables additions of an attribute layer to the
objects. Files, tables, documents and images are also the objects.
Now consider students' admission database. That follows a fixed schema. Later, additional data is
added as the course progresses. NoSQL data store characteristics are schema-less. The additional
7
data may not be structured and follow fixed schema. NoSQL data store possess characteristic of
increasing flexibility for data manipulation. The new attributes to database can be increasingly
added.
BASE is a flexible model for NoSQL data stores. Provisions of BASE increase flexibility.
BASE Properties:
1. (BA)Basic availability ensures by distribution of shards (many partitions of huge data store)
across many data nodes with a high degree of replication. Then, a segment failure does not
necessarily mean a complete data store unavailability.
2. Soft state ensures processing even in the presence of inconsistencies but achieving consistency
eventually.
Key-value pair:
The simplest way to implement a schema-less data store is to use key-value pairs. The data store
characteristics are high performance, scalability and flexibility. Data retrieval is fast in Key value
pairs data store, A simple string called, key maps to a large data string or BLOB (Basic Large
Object). Key value store accesses use a primary key for accessing the values. Therefore, the store
can be easily scaled up for very large data.
8
Advantages of a key-value store are as follows:
1.Data Store can store any data type in a value field. The key-value system stores the information
as a BLOB of data (such as text, hypertext, images, video and audio).
2.A query just requests the values and returns the values as a single item.
5.Returned values on queries can be used to convert into lists, table- columns, data-frame fields
and columns
6.Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operational cost.
The key-value store provides client to read and write values using a key as follows:
9
Put( key,value)-->associates the value with the key and updates a value if this key is already
present.
iii. Maintaining unique values as keys may become more difficult when the volume of data
increases.
Document Store:
3. Data stores in nested hierarchies. Hierarchical information stores in a single unit called
document tree.
4. Querying is easy.
5. No object relational mapping enables easy search by following paths from the root of document
tree.
10
6. Transactions on the document store exhibit ACID properties.
The demerits in Document Store are incompatibility with SQL and complexity for
implementation. Examples of Document Data Stores are CouchDB and MongoDB.
CSV data store is a format for records CSV does not represent object-oriented databases or
hierarchical data records. JSON and XML represent semi structured data, object-oriented records
and hierarchical data records. JSON (Java Script Object Notation) refers to a language format for
semi structured data, JSON represents object-oriented and hierarchical data records.
11
JSON example:
1. {
2. "employee": {
3. "name": "sonoo",
4. "salary": 56000,
5. "married": true
6.}
7.}
12
The document store allows querying the data based on the contents as well. For example, it is
possible to search the document where student's first name is "Ashish". Document store can also
provide the search value's exact location. The search is by using the document path. A type of key
accesses the leaf values in the tree structure. Since the document stores are schema-less, adding
fields to documents (XML or JSON) becomes a simple task.
13
The document store follows a tree-like structure (similar to directory structure in file system).
Beneath the root element there are multiple branches. Each branch has a related path expression
that provides a way to navigate from the root to any given branch, sub-branch or value.
XQuery and XPath are query languages for finding and extracting elements and attributes from
XML documents. The query commands use sub-trees and attributes of documents. The querying
is similar as in SQL for databases. XPath treats XML document as a tree of nodes. XPath queries
are expressed in the form of XPath expressions.
➢ XML is used to describe structured data and does not include arrays, whereas JSON includes
arrays.
➢ JSON has basically key-value pairs and is easier to parse from JavaScript.
14
➢ The concise syntax of JSON for defining lists of elements makes it preferable for serialization
of text format objects.
Document Collection
A collection can be used in many ways for managing a large document store. Three uses of a
document collection are:
2. Enables navigating through document hierarchies, logically grouping similar documents and
storing business rules such as permissions, indexes and triggers.
Columnar Data Store A way to implement a schema is the divisions into columns. Storage of
each column, successive values is at the successive memory addresses. Analytics processing (AP)
Inmemory uses columnar storage in memory. A pair of row-head and column-head is a keypair.
The pair accesses a field in the table.
Examples of columnar family data stores are HBase, BigTable, HyperTable and Cassandra.
Columns Families Two or more columns in data-store group into one column family. Grouping
of Column Families Two or more column-families in data store form a super group, called super
column.
15
Characteristics of Columnar Family Data Store
1. Scalability: The back-end system can distribute queries over a large number of processing nodes
without performing any Join operations.
16
3. Availability: The cost of replication is lower since the system scales on distributed nodes
efficiently.
6. Querying all the held values: in a column in a family, all columns in the family or a group of
column-families, is fast in in-memory column-family data store.
7. Replication of columns: HDFS-compatible column-family data stores replicate each data store
with default replication factor = 3.
8. No optimization for Join: Column-family data stores are similar to sparse matrix data. The data
do not optimize for Join operations
Typical uses of column store are: (i). Web crawling, (ii)large sparsely populated tables (iii) system
that has high varience
3. Compatibility with MapReduce, HBase APIs which are open-source Big Data platforms.
4. Key for a field uses not only row_ID and Column_ID but also timestamp and attributes. Values
are ordered bytes. Therefore, multiple versions of values may be present in the BigTable.
17
8. APIs include security and permissions
9. BigTable, being Google's cloud service, has global availability and its service is seamless.
2.System metadata which provides information such as filename, creation date, last modified
language used (such as Java, C, C#, C++, Smalltalk, Python), access permissions, supported query
3 Custom metadata which provides information, such as subject, category, sharing permissions.
Eleven Functions Supporting APIs An Object data store consists of functions supporting APIs for
(i)scalability, (ii) indexing. (iii) large collections, (iv) querying language, processing and
optimization (s). (v) Transactions, (vi) data replication for high availability, data distribution
model, data integration (vii) schema evolution, (viii) persistency. (ix) persistent object life cycle,
(x) adding modules and (xi) locking and caching strategy.
Amazon S3 (Simple Storage Service) S3 refers to Amazon web service on the cloud named S3.
The S3 provides the Object Store. The Object Store offers from the block and file-based cloud
storage. Objects along with their metadata store for each object store as the files. S3 assigns an ID
number for each stored object. The service has two storage classes: Standard and infrequent access.
Interfaces for S3 service are REST, SOAP and Bit Torrent. S3 uses include web hosting, image
hosting and storage for backup systems. S3 is scalable storage infrastructure, same as used in
Amazon e-commerce service. S3 may store trillions of Objects
Object relational mapping of HTML document and XML web service store with the tabular data
store:
18
Graph Database:
A characteristic of graph is high flexibility. Any number of nodes and any number of edges can be
added to expand a graph. The complexity is high and the performance is variable with scalability.
Data store as series of interconnected nodes. Graph with data nodes interconnected provides one
of the best database system when relationships and relationship types have critical values. Data
Store focuses on modeling interconnected structure of data.Nodes represent entities or objects.
Edges encode relationships between nodes. Some operations become simpler to perform using
graph models. Examples of graph model usages are social networks of connected people. The
connections to related persons become easier to model when using the graph model.
The following example explains the graph database application in describing entities
relationships and relationship types
19
Characteristics of graph databases are:
2. Create a database system which models the data in a completely different way than the key-
values, document, columnar and object data store models.
4. Consists of a collection of small data size records, which have complex interactions between
graph-nodes and hypergraph nodes.
20
(i) link analysis, (ii) friend of friend queries, (iii) Rules and inference, (iv) rule induction
and (v) Pattern matching. Link analysis is needed to perform searches and look for
patterns and relationships in situations, such as social
networking, telephone, or email records .Rules and inference are used to run queries on
complex structures such as class libraries, taxonomies and rule-based systems. Examples of
graph DBs are Neo4J, AllegroGraph, HyperGraph, Infinite Graph, Titan and FlockDB. Neo4]
(i) limits the support for Join queries, supports sparse matrix like columnar-family,
(ii) characteristics of easy creation and high processing speed, scalability and storability of
much higher magnitude of data (terabytes and petabytes).
NoSQL sacrifices the support of ACID properties, and instead supports CAP and BASE
properties NoSQL data processing scales horizontally as well vertically.
Big Data solution needs scalable storage of terabytes and petabytes, dropping of support for
database Joins, and storing data differently on several distributed servers (data nodes) together
as a cluster.
1. High and easy scalability: NoSQL data stores are designed to expand horizontally.
Horizontal scaling means that scaling out by adding more machines as data nodes (servers)
into the pool of resources (processing, memory, network connections). The design scales out
using multi-utility cloud services.
2. Support to replication: Multiple copies of data store across multiple nodes of a cluster. This
ensures high availability, partition, reliability and fault tolerance.
21
4. Usages of NoSQL servers which are less expensive. NoSQL data stores require less
management efforts. It supports many features like automatic repair, easier data distribution
and simpler data models that makes database administrator (DBA) and tuning requirements
less stringent.
5. Usages of open-source tools: NoSQL data stores are cheap and open source. Database
implementation is easy and typically uses cheap servers to manage the exploding data and
transaction while RDBMS database are expensive and use big servers and storage system
6.Support to schema-less data model: NoSQL data store is schema less, so data can be inserted
in a NoSQL data store without any predefined schema.
7. Support to integrated caching: NoSQL data store support the caching in system memory.
That increases output performance. SQL database needs a separate infrastructure for that.
8. No inflexibility unlike the SQL/RDBMS,NoSQL DBs are flexible (not rigid) and have no
structured way of storing and manipulating data..
22
Shared-Nothing Architecture for Big Data Tasks:
Shared nothing (SN) is cluster architecture. A node does not share data with any other node.
Shared nothing architecture is an architecture which is used in distributed computing in which each
node is independent and different nodes are interconnected by a network. A partition processes the
different queries on data of the different users at each node independently.
Big Data store consists of SN architecture. Big Data store, therefore, easily partitions into shards.
A partition processes the different queries on data of the different users at each node independently.
Thus, data processes run in parallel at the nodes. Data of different data stores partition among the
number of nodes. Processing may require every node to maintain its own copy of the application‟s
data, using a coordination protocol.
• Independence: Each node with no memory sharing; thus possesses computational self-
sufficiency
• Each node functioning as a shard: Each node stores a shard (a partition of large DBs)
• No network contention.
Big Data requires distribution on multiple data nodes at clusters. Distributed software components
give advantage of parallel processing; thus providing horizontal scalability. Distribution gives
23
Simplest distribution option for NoSQL data store and access is Single Server Distribution
(SSD) of an application. A graph database processes the relationships between nodes at a
server. The SSD model suits well for graph DBs. An application executes the data sequentially
on a single server.
24
25
26
27
Features of MongoDB:
1. MongoDB data store is a physical container for collections. Each DB gets its own set of files
on the file system. A number of DBs can run on a single MongoDB server. DB is default DB in
MongoDB that stores within a data folder.The database server of MongoDB is mongod and the
client is mongo.
3. Document model is well defined. Structure of document is clear; Document is the unit of storing
data in a MongoDB database. Documents are analogous to the records of RDBMS table. Insert,
update and delete operations can be performed on a collection. Document use
JASON(JavaScriptObject Notation) approach for storing data.
4. MongoDB is a document data store in which one collection holds different documents. Data
store in the form of JSON-style documents.
5. Storing of data is flexible, and data store consists of JSON-like documents. This implies that the
fields can vary from document to document and data structure can be changed over time.
7. Querying, indexing, and real time aggregation allows accessing and analyzing the data
efficiently.
9. No complex Joins.
28
29
30
31
32
Cassandra Databases:
Cassandra was developed by Facebook and released by Apache. Cassandra was named after
Trojan mythological prophet Cassandra, who had classical allusions to a curse on oracle. Later on,
IBM also released the enhancement of Cassandra, as open source version.
• Cassandra is basically a column family database that stores and handles massive data of any
format including structured, semi-structured and unstructured data.
• Apache Cassandra DBMS contains a set of programs. They create and manage databases.
• Cassandra provides functions (commands) for querying the data and accessing the required
information.
• Functions do the viewing, querying and changing (update, insert or append or delete), visualizing
and perform transactions on the DB.
• Cassandra is written in Java. Big organizations, such as Facebook, IBM, Twitter, Cisco,
Rackspace, eBay, Twitter and Netflix have adopted Cassandra.
(i) open source, (ii) scalable (iii) non- relational (v) NoSQL (iv) Distributed (vi) column
based, (vii) decentralized, (viii) fault tolerant and (ix) tuneable consistency.
• Uses Classes consisting of ordered keys and semi-structured data storage systems
• Is fast and easily scalable with write operations spread across the cluster.
• Is a distributed DBMS designed for handling a high volume of structured data across multiple
cloud servers
33
• Has peer-to-peer distribution in the system across its nodes, and the data is distributed among
all the nodes in a cluster.
Data Replication : Cassandra stores data on multiple nodes (data replication) and thus has no
single point of failure, and ensures availability, a requirement in CAP theorem. Data replication
uses a replication strategy. Replication factor determines the total number of replicas placed
on different nodes. Cassandra returns the most recent value of the data to the client. If it has
detected that some of the nodes responded with a stale value, Cassandra performs a read repair
in the background to update the stale values.
Scalability Cassandra provides linear scalability which increases the throughput and decreases
the response time on increase in the number of nodes at cluster. Transaction Support Supports
ACID properties (Atomicity, Consistency, Isolation, and Durability).
Replication Option Specifies any of the two replica placement strategy names. The strategy
names are Simple Strategy or Network Topology Strategy. The replica placement strategies
are:
2. Network Topology Strategy: Allows setting the replication factor for each data center
independently.
Cassandra Query Language (CQL) Table gives the CQL commands and their
functionalities
34
35