Unit 1
Unit 1
Generation
Technologies
MODULE-1: Big Data , NoSQL , Introducing MongoDB
This is to certify that the e-book titled “Big Data , NoSQL , Introducing MongoDB” comprises all
elementary learning tools for a better understating of the relevant concepts. This e-book is comprehensively
compiled as per the predefined eight parameters and guidelines
Signature
Ms. Seema Bhatkar
Assistant Professor Date: 13-06-2019
Department of IT
DISCLAIMER: The information contained in this e-book is compiled and distributed for educational purposes
only. This e-book has been designed to help learners understand relevant concepts with a more dynamic
interface. The compiler of this e-book and Vidyalankar Institute of Technology give full and due credit to the
authors of the contents, developers and all websites from wherever information has been sourced. We
acknowledge our gratitude towards the websites YouTube, Wikipedia, and Google search engine. No
commercial benefits are being drawn from this project.
Unit I Big Data , NoSQL , Introducing MongoDB
Contents :
Big Data: Getting Started, Big Data, Facts About Big Data, Big Data Sources, Three Vs of Big Data,
Volume, Variety, Velocity, Usage of Big Data, Visibility, Discover and Analyze Information,
Segmentation and Customizations, Aiding Decision Making, Innovation, Big Data Challenges, Policies
and Procedures, Access to Data, Technology and Techniques, Legacy Systems and Big Data, Structure
of Big Data, Data Storage, Data Processing, Big Data Technologies
NoSQL: SQL, NoSQL, Definition, A Brief History of NoSQL, ACID vs. BASE, CAP Theorem (Brewer‟s
Theorem), The BASE, NoSQL Advantages and Disadvantages, Advantages of NoSQL, Disadvantages
of NoSQL, SQL vs. NoSQL Databases, Categories of NoSQL Databases
Introducing MongoDB: History, MongoDB Design Philosophy, Speed, Scalability, and Agility, Non-
Relational Approach, JSON-Based Document Store, Performance vs. Features, Running the Database
Anywhere, SQL Comparison
Recommended Books :
1. Practical MongoDB by Shakuntala Gupta and Edward Navin Sabharwal published by Apress
2. Beginning jQuery by Jack Franklin and Russ Ferguson second edition published by Apress
3. Next Generation Databases by Guy Harrison published by Apress
4. Beginning JSON by Ben Smith published by Apress
Prerequisites/linkages
Unit II Pre- Sem. I Sem. II Sem. III Sem. IV Sem. VI
requisites
Big Data , - - WP DBMS CJ Project
NoSQL, Python
Introducing
MongoDB
Chapter -1
Big Data
Big data is data that has high volume, is generated at high velocity, and has multiple
varieties. Let’s look at few facts and figures of big data.
Source:- https://www.youtube.com/watch?v=eVSfJhssXUA
Three Vs of Big Data
1. Volume
Volume in big data means the size of the data. As businesses are becoming more transaction-
oriented so increasing numbers of transactions adding generating huge amount of data. This
huge volume of data is the biggest challenge for big data technologies. The storage and
processing power needed to store, process, and make accessible the data in a timely and cost
effective manner is massive.
2. Variety
The data generated from various devices and sources follows no fixed format or structure.
Compared to text, CSV or RDBMS data varies from text files, log files, streaming videos,
photos, meter readings, stock ticker data, PDFs, audio, and various other unstructured
formats.
New sources and structures of data are being created at a rapid pace. So the onus is on
technology to find a solution to analyze and visualize the huge variety of data that is out
there. As an example, to provide alternate routes for commuters, a traffic analysis application
needs data feeds from millions of smartphones and sensors to provide accurate analytics on
traffic conditions and alternate routes.
3. Velocity
Velocity in big data is the speed at which data is created and the speed at which it is required
to be processed. If data cannot be processed at the required speed, it loses its significance.
Due to data streaming in from social media sites, sensors, tickers, metering, and monitoring,
it is important for the organizations to speedily process data both when it is on move and
when it is static.
5. Innovation
Big data enables innovation of new ideas in the form of products and services. It enables
innovation in the existing ones in order to reach out to large segments of people. Using data
gathered for actual products, the manufacturers can not only innovate to create the next
generation product but they can also innovate sales offerings.
As an example, real-time data from machines and vehicles can be analyzed to provide insight
into maintenance schedules; wear and tear on machines can be monitored to make more
resilient machines; fuel consumption monitoring can lead to higher efficiency engines. Real-
time traffic information is already making life easier for commuters by providing them
options to take alternate routes.
2. Data Storage
Legacy systems use big servers and NAS and SAN systems to store the data. As the data
increases, the server size and the backend storage size has to be increased. Traditional legacy
systems typically work in a scaleup model where more and more compute, memory, and
storage needs to be added to a server to meet the increased data needs. Hence the processing
time increases exponentially, which defeats the other important requirement of big data,
which is velocity.
3. Data Processing
The algorithms in legacy system are designed to work with structured data such as strings
and integers. They are also limited by the size of data. Thus, legacy systems are not capable
of handling the processing of unstructured data, huge volumes of such data, and the speed at
which the processing needs to be performed.
As a result, to capture value from big data, we need to deploy newer technologies in the field
of storing, computing, and retrieving, and we need new techniques for analyzing the data.
Big Data Technologies
The recent technology advancements that enable organizations to make the most of its big
data are the following:
1. New storage and processing technologies designed specifically for large
unstructured data
2. Parallel processing
3. Clustering
4. Large grid environments
5. High connectivity and high throughput
6. Cloud computing and scale-out architectures
NoSQL
SQL
• Atomic implies either all changes of a transaction are applied completely or not applied at
all.
• Consistent means the data is in a consistent state after the transaction is applied. This
means after a transaction is committed, the queries fetching a particular data will see the
same result.
• Isolated means the transactions that are applied to the same set of data are independent of
each other. Thus, one transaction will not interfere with another transaction.
• Durable means the changes are permanent in the system and will not be lost in case of any
failures.
NoSQL
NoSQL is a term used to refer to non-relational databases . Thus, it encompasses majority of
the data stores that are not based on the conventional RDBMS principles and are used for
handling large data sets on an Internet scale.
NoSQL is an umbrella term for data stores that don’t follow the RDBMS principles.
The CAP theorem states that at any point in time a distributed system can fulfil only two of the
above three guarantees.
The BASE
Eric Brewer coined the BASE acronym . BASE can be explained as
Basically Available means the system will be available in terms of the CAP theorem.
Soft state indicates that even if no input is provided to the system, the state will change over
time. This is in accordance to eventual consistency.
Eventual consistency means the system will attain consistency in the long run, provided no
input is sent to the system during that time.
You have seen that NoSQL databases are eventually consistent but the eventual consistency
implementation may vary across different NoSQL databases.
NRW is the notation used to describe how the eventual consistency model is implemented across
NoSQL databases where
Using these notation configurations , the databases implement the model of eventual consistency.
Write Operations
N=W implies that the write operation will update all data copies before returning the control
to the client and marking the write operation as successful. This is similar to how the
traditional RDBMS databases work when implementing synchronous replication. This setting
will slow down the write performance.
If write performance is a concern, which means you want the writes to be happening fast, you
can set W=1, R=N. This implies that the write will just update any one copy and mark the
write as successful, but whenever the user issues a read request, it will read all the copies to
return the result. If either of the copies is not updated, it will ensure the same is updated, and
then only the read will be successful. This implementation will slow down the read
performance.
Hence most NoSQL implementations use N>W>1. This implies that greater than one node
needs to be updated successfully; however, not all nodes need to be updated at the same time.
Read Operations
If R is set to 1, the read operation will read any data copy, which can be outdated.
If R>1, more than one copy is read, and it will read most recent value. However, this can
slow down the read operation.
Using N<W+R always ensures that a read operation retrieves the latest value. This is because
the number of written copies and read copies are always greater than the actual number of
copies, ensuring that at least one read copy has the latest version.
Advantages of NoSQL
1. High scalability : This scaling up approach fails when the transaction rates and fast response
requirements increase. In contrast to this, the new generation of NoSQL databases is
designed to scale out (i.e. to expand horizontally using low-end commodity servers).
2. Manageability and administration : NoSQL databases are designed to mostly work with
automated repairs, distributed data, and simpler data models, leading to low manageability
and administration.
3. Low cost : NoSQL databases are typically designed to work with a cluster of cheap
commodity servers, enabling the users to store and process more data at a low cost.
4. Flexible data models : NoSQL databases have a very flexible data model, enabling them to
work with any type of data; they don’t comply with the rigid RDBMS data models. As a
result, any application changes that involve updating the database schema can be easily
implemented.
Disadvantages of NoSQL
1. Maturity: Most NoSQL databases are pre-production versions with key features that are still
to be implemented. Thus, when deciding on a NoSQL database, you should analyze the
product properly to ensure the features are fully implemented and not still on the To-do list .
2. Support: Support is one limitation that you need to consider. Most NoSQL databases are
from start-ups which were open sourced. As a result, support is very minimal as compared to
the enterprise software companies and may not have global reach or support resources.
3. Limited Query Capabilities: Since NoSQL databases are generally developed to meet the
scaling requirement of the web-scale applications, they provide limited querying
capabilities. A simple querying requirement may involve significant programming expertise.
5. Expertise : Since NoSQL is an evolving area, expertise on the technology is limited in the
developer and administrator community.
SQL vs. NoSQL Databases
SQL NoSQL
Types All types support SQL standard. Multiple types exists, such as
document stores, key value stores,
column databases, etc.
Development Developed in 1970. Developed in 2000s.
History
Examples SQL Server, Oracle, MySQL. MongoDB, HBase, Cassandra.
Data Storage Model Data is stored in rows and columns The data model depends on the
in a table, where each column is of database type. Say data is stored as a
a specific type. key-value pair for key-value stores.
The tables generally are created on In documentbased databases, the
principles of normalization. data is stored as documents.
Joins are used to retrieve data from The data model is flexible, in
multiple tables. contrast to the rigid table model of
the RDBMS.
Schemas Fixed structure and schema, so any Dynamic schema, new data types, or
change to schema involves altering structures can be accommodated by
the database. expanding or altering the current
schema.
New fields can be added
dynamically.
Scalability Scale up approach is used; this Scale out approach is used; this
means as the load increases, bigger, means distributing the data load
expensive servers are bought to across inexpensive commodity
accommodate the data. servers.
Supports Supports ACID and transactions. Supports partitioning and
Transactions availability, and compromises on
transactions.
Transactions exist at certain level,
such as the database level or
document level.
Consistency Strong consistency. Dependent on the product. Few
chose to provide strong consistency
whereas few provide eventual
consistency.
Support High level of enterprise support is Open source model. Support through
provided. third parties or companies building
the open source products.
Maturity Have been around for a long time. Some of them are mature; others are
evolving.
Querying Available through easy-to-use GUI Querying may require programming
Capabilities interfaces. expertise and knowledge. Rather
than an UI, focus is on functionality
and programming interfaces.
Expertise Large community of developers Small community of developers
who have been leveraging the SQL working on these open source tools.
language and RDBMS concepts to
architect and develop applications.
Chapter -3
Introducing MongoDB
History
In the later part of 2007, Dwight Merriman, Eliot Horowitz, and their team decided to
develop an online service. The intent of the service was to provide a platform for developing,
hosting, and auto-scaling web applications, much in line with products such as the Google App
Engine or Microsoft Azure. Soon they realized that no open source database platform suited the
requirements of the service.
A year later, the database for the service was ready to use. The service itself was never
released but the team decided in 2009 to open source the database as MongoDB. In March of
2010,the release of MongoDB 1.4.0 was considered production-ready. The latest production release
is 3.0and it was released in March 2015.MongoDB was built under the sponsorship of 10gen, a New
York–based startup.
example:
{
"Name": "ABC",
"Phone": ["1111111",
........"222222"
........],
"Fax":..}
As mentioned, keys and values come in pairs. The value of a key in a document can be left
blank. In the above example, the document has three keys, namely “Name,” ”Phone,” and “Fax.”
The “Fax” key has no value.
4. Performance vs. Features
In order to make MongoDB high performance and fast, certain features commonly
available in RDBMS systems are not available in MongoDB. MongoDB is a document-oriented
DBMS where data is stored as documents. It does not support JOINs, and it does not have fully
generalized transactions. However, it does provide support for secondary indexes, it enables
users to query using query documents, and it provides support for atomic updates at a per
document level. It provides a replica set, which is a form of master-slave replication with
automated failover, and it has built-in horizontal scaling.
SQL Comparison
The following are the ways in which MongoDB is different from SQL.
1. MongoDB uses documents for storing its data, which offer a flexible schema (documents in
same collection can have different fields). This enables the users to store nested or multi-
value fields such as arrays, hashes, etc. In contrast, RDBMS systems offer a fixed schema
where a column’s value should have a similar data type. Also, it’s not possible to store arrays
or nested values in a cell.
2. MongoDB doesn’t provide support for JOIN operations, like in SQL. However, it enables the
user to store all relevant data together in a single document, avoiding at the periphery the
usage of JOINs. It has a workaround to overcome this issue.
3. MongoDB doesn’t provide support for transactions in the same way as SQL. However, it
guarantees atomicity at the document level. Also, it uses an isolation operator to isolate write
operations that affect multiple documents, but it does not provide “all-or-nothing” atomicity
for multi-document write operations.
Source:- https://www.youtube.com/watch?v=EE8ZTQxa0AM
Questions
1. What is Big Data? What are the different sources of Big Data?(Nov-2018)
2. Explain the three Vs of Big Data. (Nov-2018)
3. Compare ACID Vs BASE(Nov-2018)
4. With the help of a neat diagram explain, the CAP theorem. (Nov-2018)
5. What are the advantages and disadvantages of NoSQL databases? (Nov-2018)
6. What are the different categories of NoSQL database? Explain each with an example.
(Nov-2018)
7. What are the different challenges Big Data posses? (May-2019)
8. Differentiate between SQL and NoSQL Databases. (May-2019)
9. What is MongoDB Design Philosophy? Explain. (May-2019)
10. Write a short note on Non-Relational Approach. (May-2019)
11. Discuss the various applications of Big Data. (May-2019)
5. MongoDB is a _________ database that provides high performance, high availability, and
easy scalability.
a) Graph
b) key value
c) Document
d) all of the mentioned
9. NoSQL databases is used mainly for handling large volumes of ______________ data.
a) Unstructured
b) Structured
c) semi-structured
d) all of the mentioned
10. MongoDB uses a ____________ lock that allows concurrent read access to a database but
exclusive write access to a single write operation.
a) Readers
b) readers-writer
c) writer
d) none of the mentioned