0% found this document useful (0 votes)
5 views

BDA (18CS72) Module-III

The document discusses the VTU Connect app, which provides students with instant updates, notes, question papers, and a community platform. It also covers key concepts in Big Data Analytics, focusing on distributed computing, NoSQL data stores, and the CAP theorem. The document highlights the advantages and limitations of NoSQL systems, including flexibility, scalability, and eventual consistency.

Uploaded by

gn21cs048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

BDA (18CS72) Module-III

The document discusses the VTU Connect app, which provides students with instant updates, notes, question papers, and a community platform. It also covers key concepts in Big Data Analytics, focusing on distributed computing, NoSQL data stores, and the CAP theorem. The document highlights the advantages and limitations of NoSQL systems, including flexibility, scalability, and eventual consistency.

Uploaded by

gn21cs048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Best VTU Student Companion App You Can Get

DOWNLOAD NOW AND GET


Instant VTU Updates, Notes, Question Papers,
Previous Sem Results (CBCS), Class Rank, University Rank,
Time Table, Students Community, Chat Room and Many
More

CLICK BELOW TO DOWNLOAD VTU CONNECT APP


IF YOU DON’T HAVE IT

* Visit https://vtuconnect.in for more info. For any queries or questions wrt our
platform contact us at: [email protected]
Download & Share VTU Connect App Now From Google Play Store
1 Big Data Analytics (18CS72)

Module -3
NoSQL
3.1 Introduction
Big Data uses distributed systems. A distributed system consists of multiple data nodes at
clusters of machines and distributed software components. The tasks execute in parallel with
data at nodes in clusters. The computing nodes communicate with the applications through a
network.

Following are the features of distributed-computing architecture (Chapter

l. Increased reliability and fault tolerance: The important advantage of distributed computing
system is reliability. If a segment of machines in a cluster fails then the rest of the machines
continue work. When the datasets replicate at number of data nodes, the fault tolerance increases
further. The dataset in remaining segments continue the same computations as being done at
failed segment machines.

2. Flexibility makes it very easy to install, implement and debug new services in a distributed
environment.

3. Sharding is storing the different parts of data onto different sets of data nodes, clusters or
servers. For example, university students huge database, on sharding divides in databases, called
shards. Each shard may correspond to a database for an individual course and year. Each shard
stores at different nodes or servers.

4. Speed: Computing power increases in a distributed computing system as shards run parallelly
on individual data nodes in clusters independently (no data sharing between shards).

5. Scalability: Consider sharding of a large database into a number of shards, distributed for
computing in different systems. When the database expands further, then adding more machines

and increasing the number of shards provides horizontal scalability. Increased computing power
and running number of algorithms on the same machines provides vertical scalability.Resources
sharing: Shared resources of memory, machines and network architecture reduce the cost.

Open system makes the service accessible to all nodes.

6. Performance: The collection of processors in the system provides higher performance than

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 1

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
2 Big Data Analytics (18CS72)

a centralized computer, due to lesser cost of communication among machines (Cost means time
taken up in communication).

3.2 NOSQL DATA STORE

SQL is a programming language based on relational algebra. It is a declarative language and it


defines the data schema . SQL creates databases and RDBMS s. RDBMS uses tabular data store
with relational algebra, precisely defined operators with relations as the operands. Relations are
a set of tuples. Tuples are named attributes. A tuple identifies uniquely by keys called candidate
keys.

ACID Properties in SQL Transactions


Atomicity of transaction means all operations in the transaction must complete, and if
interrupted, then must be undone (rolled back). For example, if a customer withdraws an amount
then the bank in first operation enters the withdrawn amount in the table and in the next operation
modifies the balance with new amount available. Atomicity means both should be completed,
else undone if interrupted in between.

Consistency in transactions means that a transaction must maintain the integrity constraint, and
follow the consistency principle. For example, the difference of sum of deposited amounts and
withdrawn amounts in a bank account must equal the last balance. All three data need to be
consistent.

Isolation of transactions means two transactions of the database must be isolated from each
other and done separately.

Durability means a transaction must persist once completed

NOSQL

A new category of data stores is NoSQL (means Not Only SQL) data stores. NoSQL is an
altogether new approach of thinking about databases, such as schema flexibility, simple
relationships, dynamic schemas, auto sharding, replication, integrated caching, horizontal
scalability of shards, distributable tuples, semi-structures data and flexibility in approach.

Issues with NoSQL data stores are lack of standardization in approaches, processing difficulties
for complex queries, dependence on eventually consistent results in place of consistency in all
states.

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 2

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
3 Big Data Analytics (18CS72)

Big Data NoSQL

NoSQL records are in non-relational data store systems. They use flexible data models. The
records use multiple schemas. NoSQL data stores are considered as semi-structured data. Big
Data Store uses NoSQL.

NoSQL data store characteristics are as follows:

1. NoSQL is a class of non-relational data storage system with flexible data model.
Examples of NoSQL data-architecture patterns of datasets are key-value pairs,
name/value pairs, Column family,Big-data store, Tabular data store, Cassandra (used in
Facebook/Apache), HBase, hash table [Dynamo (Amazon S3)], unordered keys using
]SON (CouchDB), ]SON (PNUTS), ]SON (MongoDB), Graph Store, Object Store,
ordered keys and semi-structured data storage systems.

2. NoSQL not necessarily has a fixed schema, such as table; do not use the concept of Joins
(in distributed data storage systems); Data written at one node can be replicated to
multiple nodes. Data store is thus fault- tolerant. The store can be partitioned into
unshared shards.

Features in NoSQL Transactions NoSQL transactions have following features:

1. Relax one or more of the ACID properties.

2. Characterize by two out of three properties (consistency, availability and partitions) of


CAP theorem, two are at least present for the application/ service/process.

3. Can be characterized by BASE properties

Big Data NoSQL Solutions NoSQL DBs are needed for Big Data solutions. They play an
important role in handling Big Data challenges. Table 3.1 gives the examples of widely used
NoSQL data stores.

Table 3.1 NoSQL data stores and their characteristic features

HDFS compatible, open-source and non-relational data store written inJava;


Apache's A column-family based NoSQL data store, data store providing BigTable-like
HBase capabilities (Sections 2.6 and 3.3.3.2); scalability, strong consistency,
versioning, configuring and maintaining data store characteristics

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 3

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
4 Big Data Analytics (18CS72)

HDFS compatible; master-slave distribution model (Section 3.5.1.3);


Apache's document-oriented data store withJSON-like documents and dynamic
MongoDB schemas; open-source, NoSQL, scalable and non-relational database; used by
Websites Craigslist, eBay, Foursquare at the backend

HDFS compatible DBs; decentralized distribution peer-to-peer model


Apache's (Section 3.5.1.4); open source; NoSQL; scalable, non-relational, column-
Cassandra family based, fault-tolerant and tuneable consistency (Section 3.7) used by
Facebook and Instagram

A project of Apache which is also widely used database for the web.
Apache's CouchDB consists of Document Store. It uses theJSON data exchange format
CouchDB to store its documents,JavaScript for indexing, combining and transforming
documents, and HTTP APis

Oracle Step towards NoSQL data store; distributed key-value data store; provides
NoSQL transactional semantics for data manipulation , horizontal scalability, simple
administration and monitoring

An open-source key-value store; high availability (using replication


Riak concept), fault tolerance, operational simplicity, scalability and written in
Erlang

CAP Theorem Among C, A and P, two are at least present for the
application/service/process. Consistency means all copies have the same value like in
traditional DBs. Availability means at least one copy is available in case a partition
becomes inactive or fails. For example, in web applications, the other copy in the
other partition is available. Partition means parts which are active but may not
cooperate (share) as in distributed DBs.

1. Consistency in distributed databases means that all nodes observe the same data at the
same time. Therefore, the operations in one partition of the database should
reflect in other related partitions in case of distributed database. Operations,
which change the sales data from a specific showroom in a table should also
reflect in changes in related tables which are using that sales data.
2. Availability means that during the transactions, the field values must be available
in other partitions of the database so that each request receives a response on
success as well as failure. (Failure causes the response to request from the replicate
of data). Distributed databases require transparency between one another. Network

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 4

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
5 Big Data Analytics (18CS72)

failure may lead to data unavailability in a certain partition in case of no replication.


Replication ensures availability.

3. Partition means division of a large database into different databases without


affecting the operations on them by adopting specified procedures.
4. Partition tolerance: Refers to continuation of operations as a whole even in case of
message loss, node failure or node not reachable.

Brewer's CAP (c.onsistency, Availability and fartition Tolerance) theorem


demonstrates that any distributed system cannot guarantee C, A and P together.

1. Consistency- All nodes observe the same data at the same time.

2. Availability- Each request receives a response on success/failure.

3. Partition Tolerance-The system continues to operate as a whole even in case of


message loss, node failure or node not reachable.

Partition tolerance cannot be overlooked for achieving reliability in a distributed


database system. Thus, in case of any network failure, a choice canbe:

• Database must answer, and that answer would be old or wrong data (AP).

• Database should not answer, unless it receives the latest copy of the data(CP).

The CAP theorem implies that for a network partition system, the choice of consistency
and availability are mutually exclusive. CA means consistency andavailability, AP means
availability and partition tolerance and CP means consistency and partition tolerance.
Figure 3.1 shows the CAP theorem usage in Big Data Solutions.

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 5

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
6 Big Data Analytics (18CS72)

Schema Less Database

Schema of a database system refers to designing of a structure for datasets and data structures
for storing into the database. NoSQL data not necessarily have a fixed table schema. The
systems do not use the concept of Join (between distributed datasets). A cluster-based highly
distributed node manages a single large data store with a NoSQL DB. Data written at one
node replicates to multiple nodes. Therefore, these are identical, fault-tolerant and partitioned
into shards. Distributed databases can store and process a set of information on more than one
computing nodes.

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 6

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
7 Big Data Analytics (18CS72)

Increasing Flexibility for Data Manipulation

NoSQL data store possess characteristic of increasing flexibility for data manipulation.
The new attributes to database can be increasingly added. Late binding of them is also
permitted.

BASE Properties BA stands for basic availability, S stands for soft state and E stands
for eventual consistency.

l. Basic availability ensures by distribution of shards (many partitions of huge data store)
across many data nodes with a high degree of replication. Then, a segment failure does not
necessarily mean a complete data store unavailability.

2. Soft state ensures processing even in the presence of inconsistencies but achieving
consistency eventually. A program suitably takes into account the inconsistency found
during processing. NoSQL database design does not consider the need of consistency all
along the processing time.

3. Eventual consistency means consistency requirement in NoSQL databases meeting


at some point of time in future. Data converges eventually to a consistent state with no time-
frame specification for achieving that. ACID rules require consistency all along the
processing on completion of each transaction. BASE does not have that requirement and has
the flexibility.
SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 7

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
8 Big Data Analytics (18CS72)

3.3 NOSQL DATA ARCHITECTURE PATTERNS

3.3.1 Key-Value Store

The simplest way to implement a schema-less data store is to use key-value pairs.
The data store characteristics are high performance, scalability and flexibility. Data retrieval
is fast in key-value pairs data store. A simple string called, key maps to a large data string
or BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing the
values. Therefore, the store can be easily scaled up for very large data. The concept is similar
to a hash table where a unique key points to a particular item(s) of data. Figure 3.4 shows key-
value pairs architectural pattern and example of students' database as key-value pairs

Advantages of a key-value store are as follows:


1. Data Store can store any data type in a value field. The key-value system
stores the information as a BLOB of data (such as text, hypertext, images,video
and audio) and return the same BLOB when the data is retrieved. Storage is like
an English dictionary. Query for a word retrieves the meanings, usages, different
forms as a single item in the dictionary. Similarly, querying for key retrieves the
values.

2. A query just requests the values and returns the values as a single item. Values can

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 8

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
9 Big Data Analytics (18CS72)

be of any data type.

3. Key-value store is eventually consistent.

4. Key-value data store may be hierarchical or may be ordered key-value store.

5. Returned values on queries can be used to convert into lists, table- columns, data-
frame fields and columns.

6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operationalcost.

7. The key can be synthetic or auto-generated. The key is flexible and can be represented
in many formats: (i) Artificially generated strings created from a hash of a value, (ii)
Logical path names to images or files, (iii) RESTweb-service calls (request response
cycles), and (iv) SQL queries.

Limitations of key-value store architectural pattern are:


1. No indexes are maintained on values, thus a subset of values is not searchable.
2. Key-value store does not provide traditional database capabilities, such as atomicity of
transactions, or consistency when multiple transactions are executed simultaneously.
The application needs to implement such capabilities.
3. Maintaining unique values as keys may become more difficult when the volume of data
increases. One cannot retrieve a single result when a key- value pair is not uniquely
identified.
4. Queries cannot be performed on individual values. No clause like 'where' in a relational
database usable that filters a result set.

Table 3.2 Traditional relational data model vs. the key-value store model

Traditional relational model Key-value store model

Result set based on row values Queries return a single item

Values of rows for large datasets are indexed No indexes on values

Same data type values in columns Any data type values

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 9

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
10 Big Data Analytics (18CS72)

Typical uses of key-value store are:

(i) Image store,

(ii) Document or file store,

(iii) Lookup table, and


(iv) Query-cache.
Riak is open-source Erlang language data store. It is a key-value data store system. Data auto-
distributes and replicates in Riak. It is thus, fault tolerant and reliable. Some other widely used
key-value pairs in NoSQL DBs are Amazon's DynamoDB, Redis (often referred as Data Structure
server), Memcached and its flavours, Berkeley DB, upscaledb (used for embedded databases),
project Voldemort and Couchbase.

Document Store
Characteristics of Document Data Store are high performance and flexibility. Scalability
varies, depends on stored contents. Complexity is low compared to tabular, object and graph
data stores.

Following are the features in Document Store:

1. Document stores unstructured data.

2. Storage has similarity with object store.

3. Data stores in nested hierarchies. For example, inJSON formats data model[Example
3.3(ii)], XML document object model (DOM), or machine-readable data as one BLOB.
Hierarchical information stores in a single unit called document tree. Logical data stores
together in a unit.

4. Querying is easy. For example, using section number, sub-section number and figure
caption and table headings to retrieve document partitions.

5. No object relational mapping enables easy search by following paths fromthe root of
document tree.

6. Transactions on the document store exhibit ACID properties.

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 10

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
11 Big Data Analytics (18CS72)

Typical uses of a document store are: (i) office documents, (ii) inventory store,

(iii) forms data, (iv) document exchange and (v) document search.

Examples of Document Data Stores are CouchDB and MongoDB.

CSV and JSON File Formats CSV data store is a format for records CSV does not represent
object-oriented databases or hierarchical data records. ]SON and XML represent semistructured
data, object- oriented records and hierarchical data records. ]SON (Java Script Object Notation)
refers to a language format for semistructured data. ]SON represents object-oriented and
hierarchical data records, object, and resource arrays in JavaScript.

JSON Files
 Semi-structured data
 object-oriented records and hierarchical data records
 JSON refers to a language format for semistructured data. JSON represents object-oriented and
hierarchical data records, object, and resource arrays in JavaScript

Document JSON Format CouchDB Database Apache CouchDB is an open- source


database. Its features are:
 CouchDB provides mapping functions during querying, combining and filtering of
information.
 CouchDB deploys JSON Data Store model for documents. Each document maintains separate
data and metadata (schema).
 CouchDB is a multi-master application. Write does not require field locking when controlling
the concurrency during multi-master application.
SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 11

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
12 Big Data Analytics (18CS72)

 CouchDB querying language is JavaScript. Java script is a language which

XML

 An extensible, simple and scalable language. Its self-describing format describes structure and
contents in an easy to understand format
 XML is widely used. The document model consists of root element and their sub-elements.
XML document model has a hierarchical structure. XML document model has features of
object-oriented records. XML format finds wide uses in data store and
 XML document model has a hierarchical structure. XML document model has features of
object-oriented records. XML format finds wide uses in data store

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 12

Download & Share VTU Connect App Now From Google Play Store
Download & Share VTU Connect App Now From Google Play Store
13 Big Data Analytics (18CS72)

Tabular data stores use rows and columns. Row-head field may be used as a keywhich
access and retrieves multiple values from the successive columns in that row. The OLTP is
fast on in-memory row-format data.

Columnar Data Store A way to implement a schema is the divisions into columns.
Storage of each column, successive values is at the successive memory addresses.
Analytics processing (AP) In-memory uses columnar storage in memory. A pair of row-
head and column-head is a key-pair. The pair accesses a field in the table.

Column-Family Data Store Column-family data-store has a group of columns as a


column family. A combination of row-head, column-family head and table- column
head can also be a key to access a field in a column of the table during querying.
Combination of row head, column families head, column-family head and column head
for values in column fields can also be a key to access fields ofa column. A column-
family head is also called a super-column head.

SUNIL G L, A.P, DEPT. OF CSE, SVIT , BENGALURU 13

Download & Share VTU Connect App Now From Google Play Store

You might also like