IDS Unit3
IDS Unit3
UNIT-III
NoSQL movement for handling Bigdata: Distributing data storage and processing
withHadoop framework, case study on risk assessment for loan sanctioning, ACID principle
of relational databases, CAP theorem, base principle of NoSQL databases, types of NoSQL
databases, case study on disease diagnosis and profiling
Theorem: HDFS is a distributed storage system designed for large-scale data processing. It
divides files into blocks (default size 128 MB or 256 MB) and distributes them across
multiple nodes in a cluster. Each block is replicated (typically three times) to ensure data
redundancy and fault tolerance. The architecture is built to handle node failures by
replicating data across different nodes.
Diagram: Imagine you have a file File1.txt that is 500 MB. HDFS would divide it into
blocks as follows:
Example: Suppose a file of 500 MB is stored in HDFS. It would be divided into 4 blocks:
Block 1, Block 2, Block 3, Block 4.
Each block would be replicated three times and stored on different nodes for fault
tolerance.
2. MapReduce:
Distributed Processing: Hadoop processes data using the MapReduce model, which
breaks the task into two phases:
1. Map Phase: Processes data in parallel across multiple nodes by splitting the
dataset into smaller chunks, known as splits.
Each file is divided into smaller chunks, and the mapper function processes each
chunk to produce key-value pairs like (word, 1) for each word.
Example: The word "hadoop" appears in File1 and File2. After mapping, you would
have key-value pairs like:
(hadoop, 1)
(hadoop, 1)
2. Reduce Phase: After the Map phase, the Reduce phase aggregates the processed
data. The reducer takes these key-value pairs and sums the counts of each word:
Example: For the word "hadoop", the output would be:
(hadoop, 2)
JobTracker and TaskTracker: In older versions of Hadoop, the JobTracker assigns tasks to
nodes (TaskTrackers) and monitors their execution. In newer versions (YARN),
ResourceManager and NodeManager handle resource management.
Diagram:
Figure: A simplified example of a MapReduce flow for counting the colors in input texts
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can
have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are
grouped together, and depending on the reducing function, a different result can be created.
Here we wanted a count per color, so that’s what the reduce function returns. In reality it’s a
bit more complicated than this though
Figure: An example of a MapReduce flow for counting the colors in input texts
Introduction to NoSQL
NoSQL databases are designed to manage large-scale data across distributed systems and
provide more flexible ways to model data, depending on the use case. Unlike relational
databases that strictly follow predefined structures, NoSQL databases allow the data model to
fit the needs of the application.
In order to understand NoSQL, we first need to explore the core principles of relational
databases, known as ACID, and how NoSQL rewrites these principles into BASE to better
suit distributed environments. Additionally, we’ll look at the CAP theorem, which explains
the challenges of distributing databases across multiple nodes and how ACID and BASE
handle these challenges differently.
Figure: sharding: each shard can function as a self-sufficient database, but they also work
together as a whole. The example represents two nodes, each containing four shards: two
main shards and two replicas. Failure of one node is backed up by the other.
2. Soft State:
o The state of the database may change over time, even without new input, due to the
eventual consistency model. This means the system doesn't guarantee immediate
consistency after every transaction.
o Example: Data in one node might say "X" and another node might say "Y"
temporarily, but this will be resolved later when the nodes synchronize their data.
3. Eventual Consistency:
o The database will become consistent over time, but it might allow for temporary
inconsistencies. Eventually, after all updates are synchronized, every node will hold
the same data.
o Example: If two customers purchase the last item in stock at the same time, the
database may show inconsistent results for a short period, but it will eventually
reconcile the conflict and decide who gets the item.
ACID versus BASE
The BASE principles are somewhat contrived to fit acid and base from chemistry: anacid
is a fluid with a low pH value. A base is the opposite and has a high pH value.We won’t
go into the chemistry details here, but figure shows a mnemonic tothose familiar with the
chemistry equivalents of acid and base.
Chandrika Surya
Ass.professo r
Dept.of AIML
Graph-based databases
1. Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In
the Document database, the particular elements can be accessed by using the index value that
is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all
the documents are in any collection as they require a similar schema because document
databases have a flexible schema.
A key-value store is like a relational database with only two columns which is the key and
the value.
Simplicity.
Scalability.
Speed.
3. Column Oriented Databases (Or) Column Family Data stores (Or) Wide column
data stores
Columnar databases are designed to read data more efficiently and retrieve the data with
greater speed. A columnar database is used to store a large amount of data. Key features
of columnar oriented database:
Scalability.
Compression.
Very responsive.
4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Borrower: A potential customer, Mr. Rohan Sharma, a 32-year-old software engineer, seeking a
personal loan of ₹500,000 (approximately $6,000 USD) to purchase a car.