BDA (18CS72) Module-III
BDA (18CS72) Module-III
* Visit https://vtuconnect.in for more info. For any queries or questions wrt our
platform contact us at: [email protected]
Download & Share VTU Connect App Now From Google Play Store
1 Big Data Analytics (18CS72)
Module -3
NoSQL
3.1 Introduction
Big Data uses distributed systems. A distributed system consists of multiple data nodes at
clusters of machines and distributed software components. The tasks execute in parallel with
data at nodes in clusters. The computing nodes communicate with the applications through a
network.
l. Increased reliability and fault tolerance: The important advantage of distributed computing
system is reliability. If a segment of machines in a cluster fails then the rest of the machines
continue work. When the datasets replicate at number of data nodes, the fault tolerance increases
further. The dataset in remaining segments continue the same computations as being done at
failed segment machines.
2. Flexibility makes it very easy to install, implement and debug new services in a distributed
environment.
3. Sharding is storing the different parts of data onto different sets of data nodes, clusters or
servers. For example, university students huge database, on sharding divides in databases, called
shards. Each shard may correspond to a database for an individual course and year. Each shard
stores at different nodes or servers.
4. Speed: Computing power increases in a distributed computing system as shards run parallelly
on individual data nodes in clusters independently (no data sharing between shards).
5. Scalability: Consider sharding of a large database into a number of shards, distributed for
computing in different systems. When the database expands further, then adding more machines
and increasing the number of shards provides horizontal scalability. Increased computing power
and running number of algorithms on the same machines provides vertical scalability.Resources
sharing: Shared resources of memory, machines and network architecture reduce the cost.
6. Performance: The collection of processors in the system provides higher performance than
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
2 Big Data Analytics (18CS72)
a centralized computer, due to lesser cost of communication among machines (Cost means time
taken up in communication).
Consistency in transactions means that a transaction must maintain the integrity constraint, and
follow the consistency principle. For example, the difference of sum of deposited amounts and
withdrawn amounts in a bank account must equal the last balance. All three data need to be
consistent.
Isolation of transactions means two transactions of the database must be isolated from each
other and done separately.
NOSQL
A new category of data stores is NoSQL (means Not Only SQL) data stores. NoSQL is an
altogether new approach of thinking about databases, such as schema flexibility, simple
relationships, dynamic schemas, auto sharding, replication, integrated caching, horizontal
scalability of shards, distributable tuples, semi-structures data and flexibility in approach.
Issues with NoSQL data stores are lack of standardization in approaches, processing difficulties
for complex queries, dependence on eventually consistent results in place of consistency in all
states.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
3 Big Data Analytics (18CS72)
NoSQL records are in non-relational data store systems. They use flexible data models. The
records use multiple schemas. NoSQL data stores are considered as semi-structured data. Big
Data Store uses NoSQL.
1. NoSQL is a class of non-relational data storage system with flexible data model.
Examples of NoSQL data-architecture patterns of datasets are key-value pairs,
name/value pairs, Column family,Big-data store, Tabular data store, Cassandra (used in
Facebook/Apache), HBase, hash table [Dynamo (Amazon S3)], unordered keys using
]SON (CouchDB), ]SON (PNUTS), ]SON (MongoDB), Graph Store, Object Store,
ordered keys and semi-structured data storage systems.
2. NoSQL not necessarily has a fixed schema, such as table; do not use the concept of Joins
(in distributed data storage systems); Data written at one node can be replicated to
multiple nodes. Data store is thus fault- tolerant. The store can be partitioned into
unshared shards.
Big Data NoSQL Solutions NoSQL DBs are needed for Big Data solutions. They play an
important role in handling Big Data challenges. Table 3.1 gives the examples of widely used
NoSQL data stores.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
4 Big Data Analytics (18CS72)
A project of Apache which is also widely used database for the web.
Apache's CouchDB consists of Document Store. It uses theJSON data exchange format
CouchDB to store its documents,JavaScript for indexing, combining and transforming
documents, and HTTP APis
Oracle Step towards NoSQL data store; distributed key-value data store; provides
NoSQL transactional semantics for data manipulation , horizontal scalability, simple
administration and monitoring
CAP Theorem Among C, A and P, two are at least present for the
application/service/process. Consistency means all copies have the same value like in
traditional DBs. Availability means at least one copy is available in case a partition
becomes inactive or fails. For example, in web applications, the other copy in the
other partition is available. Partition means parts which are active but may not
cooperate (share) as in distributed DBs.
1. Consistency in distributed databases means that all nodes observe the same data at the
same time. Therefore, the operations in one partition of the database should
reflect in other related partitions in case of distributed database. Operations,
which change the sales data from a specific showroom in a table should also
reflect in changes in related tables which are using that sales data.
2. Availability means that during the transactions, the field values must be available
in other partitions of the database so that each request receives a response on
success as well as failure. (Failure causes the response to request from the replicate
of data). Distributed databases require transparency between one another. Network
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
5 Big Data Analytics (18CS72)
1. Consistency- All nodes observe the same data at the same time.
• Database must answer, and that answer would be old or wrong data (AP).
• Database should not answer, unless it receives the latest copy of the data(CP).
The CAP theorem implies that for a network partition system, the choice of consistency
and availability are mutually exclusive. CA means consistency andavailability, AP means
availability and partition tolerance and CP means consistency and partition tolerance.
Figure 3.1 shows the CAP theorem usage in Big Data Solutions.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
6 Big Data Analytics (18CS72)
Schema of a database system refers to designing of a structure for datasets and data structures
for storing into the database. NoSQL data not necessarily have a fixed table schema. The
systems do not use the concept of Join (between distributed datasets). A cluster-based highly
distributed node manages a single large data store with a NoSQL DB. Data written at one
node replicates to multiple nodes. Therefore, these are identical, fault-tolerant and partitioned
into shards. Distributed databases can store and process a set of information on more than one
computing nodes.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
7 Big Data Analytics (18CS72)
NoSQL data store possess characteristic of increasing flexibility for data manipulation.
The new attributes to database can be increasingly added. Late binding of them is also
permitted.
BASE Properties BA stands for basic availability, S stands for soft state and E stands
for eventual consistency.
l. Basic availability ensures by distribution of shards (many partitions of huge data store)
across many data nodes with a high degree of replication. Then, a segment failure does not
necessarily mean a complete data store unavailability.
2. Soft state ensures processing even in the presence of inconsistencies but achieving
consistency eventually. A program suitably takes into account the inconsistency found
during processing. NoSQL database design does not consider the need of consistency all
along the processing time.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
8 Big Data Analytics (18CS72)
The simplest way to implement a schema-less data store is to use key-value pairs.
The data store characteristics are high performance, scalability and flexibility. Data retrieval
is fast in key-value pairs data store. A simple string called, key maps to a large data string
or BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing the
values. Therefore, the store can be easily scaled up for very large data. The concept is similar
to a hash table where a unique key points to a particular item(s) of data. Figure 3.4 shows key-
value pairs architectural pattern and example of students' database as key-value pairs
2. A query just requests the values and returns the values as a single item. Values can
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
9 Big Data Analytics (18CS72)
5. Returned values on queries can be used to convert into lists, table- columns, data-
frame fields and columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operationalcost.
7. The key can be synthetic or auto-generated. The key is flexible and can be represented
in many formats: (i) Artificially generated strings created from a hash of a value, (ii)
Logical path names to images or files, (iii) RESTweb-service calls (request response
cycles), and (iv) SQL queries.
Table 3.2 Traditional relational data model vs. the key-value store model
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
10 Big Data Analytics (18CS72)
Document Store
Characteristics of Document Data Store are high performance and flexibility. Scalability
varies, depends on stored contents. Complexity is low compared to tabular, object and graph
data stores.
3. Data stores in nested hierarchies. For example, inJSON formats data model[Example
3.3(ii)], XML document object model (DOM), or machine-readable data as one BLOB.
Hierarchical information stores in a single unit called document tree. Logical data stores
together in a unit.
4. Querying is easy. For example, using section number, sub-section number and figure
caption and table headings to retrieve document partitions.
5. No object relational mapping enables easy search by following paths fromthe root of
document tree.
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
11 Big Data Analytics (18CS72)
Typical uses of a document store are: (i) office documents, (ii) inventory store,
(iii) forms data, (iv) document exchange and (v) document search.
CSV and JSON File Formats CSV data store is a format for records CSV does not represent
object-oriented databases or hierarchical data records. ]SON and XML represent semistructured
data, object- oriented records and hierarchical data records. ]SON (Java Script Object Notation)
refers to a language format for semistructured data. ]SON represents object-oriented and
hierarchical data records, object, and resource arrays in JavaScript.
JSON Files
Semi-structured data
object-oriented records and hierarchical data records
JSON refers to a language format for semistructured data. JSON represents object-oriented and
hierarchical data records, object, and resource arrays in JavaScript
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
12 Big Data Analytics (18CS72)
XML
An extensible, simple and scalable language. Its self-describing format describes structure and
contents in an easy to understand format
XML is widely used. The document model consists of root element and their sub-elements.
XML document model has a hierarchical structure. XML document model has features of
object-oriented records. XML format finds wide uses in data store and
XML document model has a hierarchical structure. XML document model has features of
object-oriented records. XML format finds wide uses in data store
Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
13 Big Data Analytics (18CS72)
Tabular data stores use rows and columns. Row-head field may be used as a keywhich
access and retrieves multiple values from the successive columns in that row. The OLTP is
fast on in-memory row-format data.
Columnar Data Store A way to implement a schema is the divisions into columns.
Storage of each column, successive values is at the successive memory addresses.
Analytics processing (AP) In-memory uses columnar storage in memory. A pair of row-
head and column-head is a key-pair. The pair accesses a field in the table.
Download & Share VTU Connect App Now From Google Play Store