0% found this document useful (0 votes)
20 views

Bda CHP 3

BDA CHP 3

Uploaded by

Isam Syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Bda CHP 3

BDA CHP 3

Uploaded by

Isam Syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

THE ANJUMAN-I-ISLAM’S

M. H. SABOO SIDDIK COLLEGE OF


ENGINEERING

Department of Computer Science &


Engineering(AI & ML)
CSC702
BIG DATA ANALYTICS

Subject I/c: Prof Arshi Khan


Module 3: NoSQL
⚫ Introduction to NoSQL
⚫ NoSQL Business Drivers
⚫ NoSQL Data Architecture Patterns: Key-value
stores, Graph stores, Column family
(Bigtable)stores, Document stores, Variations of
NoSQL architectural patterns, NoSQL Case Study
⚫ NoSQL solution for big data,
⚫ Understanding the types of big data problems;
Analyzing big data with a shared-nothing
architecture; Choosing distribution models:
master-slave versus peer-to-peer;
⚫ NoSQL systems to handle big data problems.
peer-to-peer;
⚫ Four ways that NoSQL systems handle big data
problems
Introduction to NoSQL

NoSQL databases (aka "not only SQL") are


non-tabular databases and store data
differently than relational tables. NoSQL
databases come in a variety of types based
on their data model. The main types are
document, key-value, wide-column, and
graph. They provide flexible schemas and
scale easily with large amounts of data and
high user loads.
⚫ we are familiar with the concept of relational
databases that store data in rows, form
relationships between the tables, and query
the data using SQL. However, a new type of
database, NoSQL, started to rise in popularity
in the early 21st century.

⚫ NoSQL is short for “not-only SQL”, but is also


commonly called “non-relational” or “non-SQL”.
Any database technology that stores data
differently from relational databases can be
categorized as a NoSQL database.
⚫ It wasn’t until around the late 1960s that the first
implementation of a computerized database came
into existence.

⚫ Relational databases gained popularity in the 1970s


and have remained a staple in the database world
ever since.

⚫ However, as datasets became exponentially larger


and more complex, developers began to seek a
flexible and more scalable database solution. This
is where NoSQL came in.
Advantages of NoSQL
databases
⚫ Scalability: NoSQL can be an excellent choice for
massive datasets that need to be distributed across
multiple servers and locations.

⚫ Flexibility: Unlike a relational database, NoSQL


databases don’t require a schema. This means that
NoSQL can handle unstructured or
semi-structured data in different formats.

⚫ Developer Experience: NoSQL requires less


organization and thus lets developers focus more
on using the data than on figuring out how to store
it.
Drawbacks
⚫ Data Integrity: Relational databases are
typically ACID compliant, ensuring high data integrity.
NoSQL databases follow BASE principles (basic availability,
soft state, and eventual consistency) and can often sacrifice
integrity for increased data distribution and availability.
However, some NoSQL databases do offer ACID compliance.

⚫ Language Standardization: While some NoSQL databases do


use the Structured Query Language (SQL), typically, each
database uses its unique language to set up, manage, and
query data.
Types of NoSQL Databases

Key-Value
A key-value database
consists of individual
records organized via
key-value pairs. In this
model, keys and values can
Ideally, the data is also
be any type of data, ranging
simple, and we are looking
from numbers to complex
to prioritize fast queries
objects. However, keys
over fancy features.
must be unique. This means
this type of database is
best when data is
attributed to a unique key,
like an ID number.
⚫ For example, let’s say we wanted to store
shopping cart information for customers who
shop in an e-commerce store. Our key-value
database might look like this:
⚫ Amazon DynamoDB and Redis are popular
options for developers looking to work with
key-value databases.
Document

A document-based (also
called document-oriented)
database consists of data
stored in hierarchical
structures. Some supported Documents are considered
document formats include very flexible and can evolve
JSON, BSON, XML, and to fit an application’s needs.
YAML. The document-based They can even model
model is considered an relationships!
extension of the key-value
database and provides
querying capabilities not
solely based on unique keys.
⚫ For example, let’s say we wanted to store
product information for customers who shop in
our e-commerce store. A products document
might look like this:
⚫ MongoDB is a popular option for developers
looking to work with a document database.
Graph

A graph database stores


data using a graph structure. The advantage of the
In a graph structure, data is relationships built using a
stored in individual nodes graph database as opposed
(also called vertices) and to a relational database is
establishes relationships via that they are much simpler
edges (also called links or to set up, manage, and query.
lines).
⚫ For example, let’s say we wanted to build a recommendation
engine for our e-commerce store. We could establish
relationships between similar items our customers searched
for to create recommendations.
⚫ In the graph above, we can see that there are
four nodes: “Neo”, “Hiking”, “Cameras”, and
“Hiking Camera Backpack”. Because the user,
“Neo”, searched for “Hiking” and “Cameras”,
there are edges connecting all 3 nodes. More
edges are created after the search, linking a
new node, “Hiking Camera Backpack”.

⚫ Neo4j is a popular option for developers


looking to work with a graph database.
Column Oriented

A column-oriented NoSQL Column-oriented databases


database stores data similar aim to provide faster read
to a relational database. speeds by being able to
However, instead of storing quickly aggregate data for a
data as rows, it is stored as specific column.
columns.
⚫ For example, take a look at the following
e-commerce database of products:

⚫ If we wanted to analyze the total sales for all the
products, all we would need to do is aggregate data
from the sales column.

⚫ This is in contrast to a relational model that would


have to pull data from each row. We would also be
pulling adjacent data (like size information in the
above example) that isn’t relevant to our query.

⚫ Amazon’s Redshift is a popular option for


developers looking to work with a column-oriented
database.
NoSQL Business Drivers
⚫ Businesses have found value in rapidly
capturing and analyzing large amounts of
variable data, and making immediate
changes in their businesses based on the
information they receive.

⚫ Fig shows how the demands of volume,


velocity, variability, and agility play a key
role in the emergence of NoSQL solutions.
As each of these drivers applies pressure to
the single-processor relational model, its
foundation becomes less stable and in time
no longer meets the organization’s needs.
VOLUME
⚫ Without a doubt, the key factor pushing
organizations to look at alternatives to
their current RDBMSs is a need to query
Big D

⚫ Until around 2005, performance concerns


were resolved by purchasing faster
processors. In time, however, the ability to
increase processing speed was no longer an
option. As chip density increased heat
could no longer dissipate fast enough
without chip overheating. ata using clusters
of commodity processors.
⚫ This phenomenon, known as the
PowerWall, forced systems designers to
shift their focus from increasing speed on a
single chip to using more processors
working together.

⚫ The need to scale out (also known as


horizontal scaling), rather than scale up
(faster processors), moved organizations
from serial to parallel processing where
data problems are split into separate paths
and sent to separate processors to divide
and conquer the work.
Velocity
⚫ While Big Data problems are a
consideration for many organizations
moving away from RDBMS systems, the
ability of a single processor system to
rapidly read and write data is also key.

⚫ Many single processor RDBMS systems are


unable to keep up with the demands of
real-time inserts and online queries to the
database made by public-facing websites.
⚫ RDBMS systems frequently index many
columns of every new row, a process that
decreases system performance.

⚫ When single processors RDBMSs are used


as a back end to a web storefront, the
random bursts in web traffic slow down
response for everyone and tuning these
systems can be costly when both high read
and write throughput is desired.
Variability
⚫ Companies that want to capture and report on
exception data struggle when attempting to
use rigid database schema structures imposed
by RDBMS systems.

⚫ For example, if a business unit wants to


capture a few custom fields for a particular
customer, all customer rows within the
database need to store this information even
though it doesn't apply.

⚫ Adding new columns to an RDBMS requires


the system to be shut down and ALTER
TABLE commands to be run. When a
database is large, this process can impact
system availability, losing time and money in
the process.
Agility

⚫ The most complex part of building


applications using RDBMSs is the process of
putting data into and getting data out of the
database

⚫ If your data has nested and repeated


subgroups of data structures you need to
include an object-relational mapping layer.
The responsibility of this layer is to generate
the correct combination of INSERT, UPDATE,
DELETE and SELECT SQL statements to
move object data to and from the RDBMS
persistence layer.

⚫ This process is not simple and is associated


with the largest barrier to rapid change when
developing new or modifying existing
⚫ Generally, object-relational mapping
requires experienced software developers
who are familiar with object-relational
frameworks such as Java Hibernate (or
NHibernate for .Net systems).

⚫ Even with experienced staff, small change


requests can cause slowdowns in
development and testing schedules.
Case Study: How a
bank turned
challenges into
opportunities to serve
its customers using
NoSQL Database
⚫ Financial services industries are at
crossroads and are experiencing
massive changes in response to
shifting customer demands. With the
increasing adoption of cloud
technologies, digital-only enterprises
are offering innovative solutions at
the lowest cost.
⚫ Customer experience is a strategic
imperative for most organizations
today, but delivering an engaging
experience across the growing
number of digital customer
touchpoints can be challenging,
especially if they have an aging
technology stack.
⚫ Additionally, organizations have to
navigate these transformational changes
while managing vast volumes of digital
transactions, a variety of data, and velocity
without straining their business systems,
experiencing data loss, breaches, and/or
downtime.

⚫ The below graphic shows the IT priorities


of financial services institutions, and it is
no surprise that 25% of them want to
modernize their systems and equally the
Some of the bank's challenges:
⚫ Exceeding customer expectations:
India has more than 50% of
its population below age 25 and more than
65% below age 35. Banks customers are
increasingly comparing banking
experiences to other areas of their digital
lives. These digital natives aren't looking to
check their balances and deposit checks.
They are looking for more meaningful
online experiences
⚫ The bank was looking at a system that can
provide an engaging and personalized
digital customer experience in real-time
⚫ Ability to provide comprehensive
services: Provide 'Always-on' digital
services and delight customers by assisting
them through chatbot interactions.
Additionally, they want to experiment and
deliver new services such as enhanced
payment and block-chain technologies
valued by their customers.
⚫ Provide customer 360 experience: customers
want a consistent experience, regardless of the
business division they are interacting with or
the device they use in the process. Delivering
an engaging and personalized customer
experience with a single customer view and a
unified view of all interactions encompassing
each touchpoint with the bank is challenging.

⚫ Managing change without disruption: The


bank needed agility to launch new services
and make their development staff more
productive. They want to minimize outages
with high availability built into the system.
⚫ Choosing the right data management
strategy
A comprehensive data management strategy sets the stage
for establishing a deeper understanding of customer
experience.

It can offer a single view by collecting all the customer's


structured and unstructured data from across the
organization and other relevant external sources into one
place
A NoSQL database is an ideal choice.
It can store personal and demographic information and
customer interactions with the company, including calls,
chats, emails, texts, social media responses,
product/service activity history, past and present
purchases.
McKinsey's study suggests that data-driven companies
tend to be 19X profitable when they use data as a
differentiation, as they tend to acquire 23X more
customers and retain 6X more customers.
Why Oracle NoSQL Database
⚫ Support for flexible data model:
⚫ Bank can localize all data for a given entity
– such as a financial asset class or user class
– into a single document, rather than
spreading it across multiple relational
tables.
⚫ Customers can access entire documents in
a single database operation, rather than
joining separate tables spread across the
database.
⚫ As a result of this data localization,
application performance is often much
higher when using Oracle NoSQL
Database, which can be the decisive factor
⚫ Predictable scalability with always-on
availability
⚫ An Oracle NoSQL cluster can be
expanded horizontally online without
incurring any application downtime and
one hundred percent transparent to the
application. Oracle NoSQL Database
maintains multiple copies of data for high
availability purposes.
⚫ Scale-out architecture for business
continuity
⚫ Oracle NoSQL Database supports
active-active architecture with
multi-region tables. A multi-region
architecture is two or more independent,
geographically distributed Oracle NoSQL
Database clusters bridged by
bi-directional replication, ensuring the
customers always have fast access to
services and the latest data.
⚫ Simplify application development with
rich query and APIs
⚫ Oracle NoSQL provides a rich query
language and extensive secondary indexes
giving users fast and flexible access to data
with any query pattern. This can range
from simple key-value lookups to complex
search, traversals, and aggregations across
rich data structures, including embedded
sub-documents and arrays.
High-level architecture of the proposed solution
Applications Layer:
⚫ Critical components in the architecture
include:
This layer manages all user
input applications, e.g., loan or
credit card applications. The The application layer is
applications are based on responsible for doing all the
forms technology, allowing the "application plumbing":
developers to create adaptive interacting with the
and responsive documents to database, enforcing
capture information. The validation at event points,
forms have a notion of etc. It interacts with the
fragments that allows for bank's backend system
pulling out standard segments through the API gateway and
such as personal details like doesn't store any personal or
name and address, family sensitive information.
details, income details, etc.
Database Layer:

A CRM system is used Oracle NoSQL Database


primarily for lead generation has an out-of-box
to target customers. Also integration with
available in this layer is the Elasticsearch. Oracle
ELK stack (Elasticsearch, NoSQL Database also
Logstash, Kibana), which is feeds the user drop-off
primarily used to audit the log (incomplete form activity)
data stored in the NoSQL data to the orchestration
Database. framework primarily used
for retargeting the users
Marketing Layer
Additionally, it handles
This layer hosts various personalization (showing the
servers that drive the product or service a
business decision process. It customer would be
comprises servers and tools interested in buying) and
used for customer retargeting (persuading the
segmentation (identify groups potential customers to
of individuals who are similar reconsider bank's products
in attitudes, demographic and services after they left
profile, etc.) and customer or got dropped off from
journey analysis (a sum of all their app) based on the
customer experiences with drop-off campaign's data
the bank). that's coming out the
Oracle NoSQL Database.
Banking experience
re-imagined
⚫ A typical user's journey, e.g., loan
processing, starts with a user interacting
with banks loan processing applications via
– the web, mobile device, email, or even
branch. The application is served off the
forms in the application layer. At this stage,
the user fills in details and submits the
scanned supporting documents.

⚫ These scanned forms are classified, and


information is extracted, and the data is
sent to the NoSQL Database store. The data
is sent to the processing system that
triggers the underwriting process,
⚫ Depending on the underwriting process
results, an application will be approved,
denied, or sent back to the user for
additional information. If the application is
approved, the loan amount is deposited
into the user's account.

⚫ Suppose the user drops off at any point


while filling the form. In that case, this
drop-off information is stored in the
NoSQL Database and feeds into the
orchestration system to kick start the
retargeting campaign that allows the bank
⚫ The process is repeated with specific ads,
emails, or WhatsApp messages retargeting
the customers. In the event the customer
returns, they can start the journey where
they left off.

⚫ In conclusion, one of India's leading


private banks modernized and expedited
its digital presence and provided an
enhanced experience for its customers
using Oracle NoSQL Database.
NoSQL solution for big data
1. The queries should be moved to the data rather than
moving data to queries:

⚫ At the point, when an overall query is needed to be sent by a


customer to all hubs/nodes holding information, the more
proficient way is to send a query to every hub than moving a
huge set of data to a central processor.

⚫ The stated statement is a basic rule that assists to see how


NoSQL data sets have sensational execution benefits on
frameworks that were not developed for queries distribution
to hubs.

⚫ The entire data is kept inside hub/node in document form


which means just the query and result are needed to move
over the network, thus keeping big data’s queries quick.
2. Hash rings should be used for
even distribution of data:
⚫ To figure out a reliable approach to allocating a
report to a processing hub/node is perhaps the
most difficult issue with databases that are
distributed.

With a help of an arbitrarily produced


40-character key, the hash rings method helps in
even distribution of a large amount of data on
numerous servers and this is a decent approach to
uniform distribution of network load.
3. For scaling read requests,
replication should be used:
⚫ In real-time, replication is used by databases
for making data’s backup copies. Read requests
can be scaled horizontally with the help of
replication. The strategy of replication
functions admirably much of the time.
⚫ Distribution of queries to nodes
should be done by the database:
⚫ Separation of concerns of evaluation of query
from the execution of the query is important
for getting more increased performance from
queries traversing numerous hubs/nodes. The
query is moved to the data by the NoSQL
database instead of data moving to the query.
Understanding the types of big
data problems
⚫ Storage
⚫ With vast amounts of data generated daily, the
greatest challenge is storage (especially when the
data is in different formats) within legacy
systems. Unstructured data cannot be stored in
traditional databases.
⚫ Processing
⚫ Processing big data refers to the reading,
transforming, extraction, and formatting of useful
information from raw information. The input and
output of information in unified formats continue
to present difficulties.
⚫ Security
⚫ Security is a big concern for organizations.
Non-encrypted information is at risk of theft or
damage by cyber-criminals. Therefore, data security
professionals must balance access to data against
maintaining strict security protocols.

⚫ Finding and Fixing Data Quality


Issues
⚫ When dealing with the data, the utmost importance is
its accuracy. After all, every insight you glean from
data will depend on the data itself. It all begins during
the data collection phase. At this time, you want to be
sure that you’re collecting data from the right sources
at the right time if you’re going to apply the data for
outputs.
⚫ Long Response Times from
System
⚫ Clean and accurate data is just as important as
data being accessible when you need it. If
you’re using a data tool that’s slow, then by the
time your data is available for use, it could be
considered outdated and old.

⚫ Confusion with Big Data Tool


Selection
⚫ In order to overcome this challenge, it’s best
to take time performing research and not jump
too quickly into a specific tool.
⚫ Real Time Big Data Problems
⚫ data is constantly changing and evolving, which
thus impacts the insights you glean from it.

⚫ Technically, this requires a tool that can


provide up-to-date filtering and remove
redundant or irrelevant data from the picture
when you’re applying it.
⚫ Lack of Understanding
⚫ Companies can leverage data to boost performance
in many areas.

⚫ Some of the best use cases for data are to:


decrease expenses, create innovation, launch new
products, grow the bottom line, and increase
efficiency, to name a few.

⚫ Despite the benefits, companies have been slow to


adopt data technology or put a plan in place for
how to create a data-centric culture.
⚫ High Cost of Data Solutions
⚫ After understanding how your business will
benefit most from implementing data solutions,
you’re likely to find that buying and maintaining
the necessary components can be expensive.

⚫ Along with hardware like servers and storage


to software, there also comes the cost of
human resources and time.
⚫ Complex Systems for Managing
Data
⚫ Moving from a legacy data management system
and integrating a new solution comes as a
challenge in itself.

⚫ Furthermore, with data coming from multiple


sources, and IT teams creating their own data
while managing data, systems can become
complex quickly.
⚫ Sharing and Accessing Data:
⚫ Perhaps the most frequent challenge in big data
efforts is the inaccessibility of data sets from
external sources.
⚫ Sharing data can cause substantial challenges.
⚫ It include the need for inter and intra- institutional
legal documents.
⚫ Accessing data from public repositories leads to
multiple difficulties.
⚫ It is necessary for the data to be available in an
accurate, complete and timely manner
⚫ Big Data Skills
⚫ Running Big Data tools requires expertise that
is possessed by data scientists, data engineers,
and data analysts.

⚫ They have the skills to handle Big Data


challenges and come up with valuable insights
for the company they work in. The problem is
not the demand but the lack of such skills that,
in turn, becomes a challenge.
Analyzing big data with a
shared-nothing architecture
⚫ Parallel database systems have great
advantages for online transaction
processing and decision support
applications. Parallel processing divides a
large task into multiple tasks and each task
is performed concurrently on several
nodes. This gives a larger task to complete
more quickly.
⚫ Architectural Models
⚫ There are several architectural models for
parallel machines, which are given below −
Shared
Shared nothing
memory Hierarchi
Shared cal
disk
⚫ Shared nothing architecture − In this each
node has its own mass storage as well as
main memory. The processor at one node
may communicate with another processor
at another node by a high speed
interconnection network. The node
functions as the server for the data on the
disk or disks that the node owns as each
processor has its own copy of OS, DBMS
and data.
⚫ Examples − Teradata, Gamma, Bubba.
⚫ it requires careful partitioning of the data on
multiple disk nodes. Furthermore, the addition of
new nodes in the system presumably requires
reorganizing and repartitioning of the database to
deal with the load balancing issues.

⚫ Finally, fault-tolerance is more difficult than with


shared-disk, seeing as a failed node will make its
data on disk unavailable, thus requiring data
replication. It is due to its scalability advantage that
shared-nothing has been first adopted for OLAP
workloads, in particular data warehousing, as it is
easier to parallelize read-only queries.
Advantages

⚫ these architectures are more scalable and


easily support a large number of
processors.

⚫ It overcomes the disadvantages requiring


all I/O to go through a single
intercommunication network.

⚫ It provides linear speed-up and linear


scale-up that is time taken for operations
decreases in proportion to the increase in
the number of CPU’s and disks
Disadvantages

⚫ CPU to CPU communication is very slow.


⚫ The cost of communication and no-local
disk access are higher than shared memory
or shared disk architecture because
sending data involves software interaction
at both sides.
⚫ Shared nothing architecture is difficult to
load balance
Choosing distribution models:
master-slave versus peer-to-peer
⚫ What Is a Distributed System?

⚫ distributed system consists of multiple components,


possibly across geographical boundaries, that
communicate and coordinate their actions through
message passing.
⚫ To an actor outside this system, it appears as if its a single
coherent system:
⚫ Decentralized systems are distributed
systems where no specific component
owns the decision making.

⚫ While every component owns their part of


the decision, none of them have complete
information. Hence, the outcome of any
decision depends upon some sort of
consensus between all components.
⚫ In parallel computing, we use multiple
processors on a single machine to perform
multiple tasks simultaneously, possibly with
shared memory. However, in distributed
computing, we use multiple autonomous
machines with no shared memory and
communicating with message passing.
Distributed System Architecture
Master-slave:
⚫ In this model, one node of the distributed
system plays the role of master. Here,
the master node has complete
information about the system and
controls the decision making. The rest of
the nodes act as salves and perform tasks
assigned to them by the master. Further,
for fault tolerance, the master node can
have redundant standbys.
Peer to peer
⚫ There is no single master designated
amongst the nodes in a distributed system
in this model. All the nodes equally share
the responsibility of the master.

⚫ Hence, we also know this as the


multi-master or the master-less model. At
the cost of increased complexity and
communication overhead, this model
provides better system resiliency.
⚫ While both these architectures have their
own pros and cons, it’s unnecessary to
choose only one. Many of the distributed
systems actually create an architecture
that combines elements of both models.
⚫ A peer-to-peer model can provide data
distribution, while a master-slave model
can provide data replication in the same
architecture.

You might also like