BD Notes 5
BD Notes 5
• Hadoop is a framework that manages big data storage by means of parallel and distributed
processing.
• Hadoop is comprised of various tools and frameworks that are dedicated to different
sections of data management, like storing, processing, and analyzing.
• The Hadoop ecosystem covers Hadoop itself and various other related big data tools.
• Hadoop is a programming framework used in the world of big data to solve significant
big data challenges such as storing and processing.
• The Hadoop ecosystem consists of tools and frameworks that can integrate with
Hadoop. There are a lot of tools that come under the Hadoop ecosystem and each of
them has its own functionalities.
• HDFS
• MapReduce and YARN
• Apache Spark -> In-memory Data Processing
• Sqoop and Flume for data collection and ingestion
• Hive and Pig for query-based processing
• HBase and Mongo DB for NoSQL database
• Mahout and Spark MLlib for machine learning algorithms
• Solar, Lucene: Searching and Indexing
• Zookeeper for managing cluster
• Oozie for job scheduling
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
• Apache Spark's streaming APIs allow for real-time data ingestion, while Hadoop
MapReduce can store and process the data within the architecture.
• Spark can then be used to perform real-time stream processing or batch processing
on the data stored in Hadoop
• Apache Spark is a framework for real time data analytics in a distributed
computing environment.
• The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
• It executes in-memory computations to increase speed of data processing over
MapReduce.
• It is 100x faster than Hadoop for large scale data processing by exploiting in-
memory computations and other optimizations. Therefore, it requires high
processing power than Map-Reduce
• As you can see, Spark comes packed with high-level libraries, including support
for R, SQL, Python, Scala, Java etc.
• These standard libraries increase the seamless integrations in complex workflow.
• Over this, it also allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
• When we combine, Apache Spark’s ability, i.e. high processing speed, advance
analytics and multiple integration support with Hadoop’s low-cost operation on
commodity hardware, it gives the best results.
• That is the reason why, Spark and Hadoop are used together by many companies for
processing and analyzing their Big Data stored in HDFS
• It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages.
• Spark comes up with 80 high-level operators for interactive querying.
Features of Spark:
• In memory processing
• Tight Integration Of component
• Easy and In-expensive
• The powerful processing engine makes it so fast
• Spark Streaming has high level library for streaming process
Introduction
Apache Spark has many features which make it a great choice as a big data processing engine. Many of
these features establish the advantages of Apache Spark over other Big Data processing engines. Let us
look into details of some of the main features which distinguish it from its competition.
Fault tolerance
Dynamic In Nature
Lazy Evaluation
Speed
Reusability
Advanced Analytics
In Memory Computing
Cost efficient
1. Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault
tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the lineage of all the
transformations and actions needed to complete a task. So in the event of a worker node failure, the
same results can be achieved by rerunning the steps from the existing DAG.
2. Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.
3. Lazy Evaluation: Spark does not evaluate any transformation immediately. All
the transformations are lazily evaluated. The transformations are added to the DAG and the final
computation or results are available only when actions are called. This gives Spark the ability to make
optimization decisions, as all the transformations become visible to the Spark engine before performing
any action.
4. Real Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to
stream processing, letting you write streaming jobs the same way you write batch jobs.
5. Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to
10x faster on disk. Spark achieves this by minimizing disk read/write operations for intermediate
results. It stores in memory and performs disk operations only when essential. Spark achieves this
using DAG, query optimizer and highly optimized physical execution engine.
6. Reusability: Spark code can be used for batch-processing, joining streaming data against historical
7. Advanced Analytics: Apache Spark has rapidly become the de facto standard for big data processing
and data sciences across multiple industries. Spark provides both machine learning and graph
processing libraries, which companies across sectors leverage to tackle complex problems. And all this
is easily done using the power of Spark and highly scalable clustered computers. Databricks provides
8. In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in
memory and it is not required to write back intermediate results to the disk. This feature gives massive
speed to Spark processing. Over and above this, Spark is also capable of caching the intermediate
results so that it can be reused in the next iteration. This gives Spark added performance boost for any
iterative and repetitive processes, where results in one step can be used later, or there is a common
9. Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has most of the
APIs available in Java, Scala, Python and R. Also, there are advanced features available with R
language for data analytics. Also, Spark comes with SparkSQL which has an SQL like feature. SQL
developers find it therefore very easy to use, and the learning curve is reduced to a great level.
10. Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system HDFS. It offers
support to multiple file formats like parquet, json, csv, ORC, Avro etc. Hadoop can be easily leveraged
11. Cost efficient: Apache Spark is an open source software, so it does not have any licensing fee
associated with it. Users have to just worry about the hardware cost. Also, Apache Spark reduces a lot
of other costs as it comes inbuilt for stream processing, ML and Graph processing. Spark does not have
any locking with any vendor, which makes it very easy for organizations to pick and choose Spark
Conclusion
After looking at these features above it can be easily said that Apache Spark is the most advanced and
popular product from Apache which caters to Big Data processing. It has different modules for Machine
Advantage of Spark:
1. Perfect for interactive processing, iterative processing and event steam processing
2. Flexible and powerful
3. Supports for sophisticated analytics
4. Executes batch processing jobs faster than MapReduce
5. Run on Hadoop alongside other tools in the Hadoop ecosystem
Disadvantage of Spark:
1. Consumes a lot of memory
2. Issues with small file
3. Less number of algorithms
4. Higher latency compared to Apache fling
Apache Spark is a lightning-fast unified analytics engine for big data and machine
learning. It is the largest open-source project in data processing. Since its release, it has
met the enterprise’s expectations in a better way in regards to querying, data processing
and moreover generating analytics reports in a better and faster way. Internet substations
like Yahoo, Netflix, and eBay, etc have used Spark at large scale. Apache Spark is
considered as the future of Big Data Platform.
If you want to know more about the structured data, semi-structured & unstructured data,
check out our blog post - types of big data.
Apache Spark has huge potential to contribute to the big data-related business in the
industry. Let’s now have a look at some of the common benefits of Apache Spark:
1. Speed
2. Ease of Use
3. Advanced Analytics
4. Dynamic in Nature
5. Multilingual
6. Apache Spark is powerful
7. Increased access to Big data
8. Demand for Spark Developers
9. Open-source community
1. Speed:
When comes to Big Data, processing speed always matters. Apache Spark is wildly
popular with data scientists because of its speed. Spark is 100x faster than Hadoop for
large scale data processing. Apache Spark uses in-memory(RAM) computing system
whereas Hadoop uses local memory space to store data. Spark can handle multiple
petabytes of clustered data of more than 8000 nodes at a time.
2. Ease of Use:
Apache Spark carries easy-to-use APIs for operating on large datasets. It offers over 80
high-level operators that make it easy to build parallel apps.
The below pictorial representation will help you understand the importance of Apache
Spark.
3. Advanced Analytics:
Spark not only supports ‘MAP’ and ‘reduce’. It also supports Machine learning (ML),
Graph algorithms, Streaming data, SQL queries, etc.
4. Dynamic in Nature:
With Apache Spark, you can easily develop parallel applications. Spark offers you over 80
high-level operators.
5. Multilingual:
Apache Spark supports many languages for code writing such as Python, Java, Scala, etc.
Apache Spark can handle many analytics challenges because of its low-latency in-memory
data processing capability. It has well-built libraries for graph analytics algorithms and
machine learning.
7. Increased access to Big data:
Apache Spark is opening up various opportunities for big data and making As per the
recent survey conducted by IBM’s announced that it will educate more than 1 million data
engineers and data scientists on Apache Spark.
Apache Spark not only benefits your organization but you as well. Spark developers are so
in-demand that companies offering attractive benefits and providing flexible work timings
just to hire experts skilled in Apache Spark. As per PayScale the average salary for Data
Engineer with Apache Spark skills is $100,362. For people who want to make a career in
the big data, technology can learn Apache Spark. You will find various ways to bridge
the skills gap for getting data-related jobs, but the best way is to take formal training
which will provide you hands-on work experience and also learn through hands-on
projects.
9. Open-source community:
The best thing about Apache Spark is, it has a massive Open-source community behind it.
Apache Spark is a lightning-fast cluster computer computing technology designed for fast
computation and also being widely used by industries. But on the other side, it also has
some ugly aspects. Here are some challenges related to Apache Spark that developers face
when working on Big data with Apache Spark.
Let’s read out the following limitations of Apache Spark in detail so that you can make an
informed decision whether this platform will be the right choice for your upcoming big
data project.
In the case of Apache Spark, you need to optimize the code manually since it doesn’t have
any automatic code optimization process. This will turn into a disadvantage when all the
other technologies and platforms are moving towards automation.
Apache Spark doesn’t come with its own file management system. It depends on some
other platforms like Hadoop or other cloud-based platforms.
3. Fewer Algorithms:
There are fewer algorithms present in the case of Apache Spark Machine Learning Spark
MLlib. It lags behind in terms of a number of available algorithms.
One more reason to blame Apache Spark is the issue with small files. Developers come
across issues of small files when using Apache Spark along with Hadoop. Hadoop
Distributed File System (HDFS) provides a limited number of large files instead of a large
number of small files.
5. Window Criteria:
Data in Apache Spark divides into small batches of a predefined time interval. So Apache
won't support record-based window criteria. Rather, it offers time-based window criteria.
Yes, Apache Spark doesn’t fit for a multi-user environment. It is not capable of handling
more users concurrency.
Conclusion
To sum up, in light of the good, the bad and the ugly, Spark is a conquering tool when we
view it from outside. We have seen a drastic change in the performance and decrease in
the failures across various projects executed in Spark. Many applications are being moved
to Spark for the efficiency it offers to developers. Using Apache Spark can give any
business a boost and help foster its growth. It is sure that you will also have a bright
future!
• Iterative algorithms.
• Interactive data mining tools.
• DSM (Distributed Shared Memory) is a very general abstraction, but this generality
makes it harder to implement in an efficient and fault tolerant manner on commodity
clusters. Here the need of RDD comes into the picture.
• In distributed computing system data is stored in intermediate stable distributed
store such as HDFS or Amazon S3
APACHE ZOOKEEPER
• Apache Zookeeper is an open-source distributed coordination service that helps to
manage a large set of hosts.
• Management and coordination in a distributed environment is tricky.
• Zookeeper automates this process and allows developers to focus on building software
features rather than worry about it’s distributed nature.
• Zookeeper helps you to maintain configuration information, naming, group services
for distributed applications.
• It implements different protocols on the cluster so that the application should not
implement on their own.
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination
of various services in a Hadoop Ecosystem.
• Before Zookeeper, it was very difficult and time consuming to coordinate between
different services in Hadoop Ecosystem.
• The services earlier had many problems with interactions like common configuration
while synchronizing data.
• Even if the services are configured, changes in the configurations of the services make it
complex and difficult to handle.
• The grouping and naming was also a time-consuming factor.
• Due to the above problems, Zookeeper was introduced.
• It saves a lot of time by performing synchronization, configuration maintenance,
grouping and naming.
• Although it’s a simple service, it can be used to build powerful solutions.
• Zookeeper in Hadoop can be considered a centralized repository where distributed
applications can put data into and retrieve data from.
• It makes a distributed system work together as a whole using its synchronization,
serialization, and coordination goals.
• For clarity, Zookeeper can be thought of as a file system where we have nodes that
store data instead of files or directories that store data.
• Zookeeper is a Hadoop Admin tool used to manage jobs in a cluster.
• For example, Apache Storm, which Twitter uses to store machine state data, has
Apache Zookeeper as a coordinator between machines.
• A deadlock occurs when two or more computers attempt to access the same shared
resource simultaneously. More precisely, they try to access each other’s resources,
which leads to a deadlock because neither system releases the resource but waits for
the other system to release it. Synchronization in Zookeeper helps resolve
deadlocks.
• Zookeeper handles this with atomicity, meaning either the entire process terminates,
or nothing is left after failure.
• So Zookeeper is an important part of Hadoop that takes care of these small but
important matters so that the developer can focus more on the application’s
functionality.
APACHE SQOOP
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases.
• While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
• When we submit Sqoop command, our main task gets divided into sub tasks which is
handled by individual Map Task internally. Map Task is the sub task, which imports
part of data to the Hadoop Ecosystem. Collectively, all Map tasks imports the whole
data. Export also works in a similar manner.
• When we submit our Job, it is mapped into Map Tasks which brings the chunk of data
from HDFS. These chunks are exported to a structured data destination. Combining
all these exported chunks of data, we receive the whole data at the destination, which
in most of the cases is an RDBMS (MYSQL/Oracle/SQL Server).
Apache SQOOP
Sqoop is defined as the tool which is used to perform data transfer operations from relational
database management system to Hadoop server.
Sqoop Driver :
• Basically, Sqoop “driver” simply refers to a JDBC Driver. Moreover, JDBC is nothing
but a standard Java API for accessing relational databases and some data warehouses.
Sqoop Connectors :
• Every database has its own dialect of SQL, there is a standard prescribing how the
language should look like.
• Dialect is a particular version of programming language
• Sqoop Connectors, Sqoop can overcome the differences in SQL dialects supported by
various databases along with providing optimized data transfer. To be more specific
connector is a pluggable piece (pluggable means included as part of runtime ).
• we use connectors to fetch metadata about transferred data (columns, associated data
types, …).
• Basic connector that is shipped with Sqoop is Generic JDBC Connector in Sqoop.
However, by name, it’s using only the JDBC interface for accessing metadata and
transferring data
• SQL is a very general query processing language. So, we can say for importing
data or exporting data out of the database server
To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig framework. Let us
take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data
model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as
a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
APACHE HIVE:
• Apache Hive is a data warehouse system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to MapReduce jobs. Hive
was developed by Facebook. It supports Data definition Language, Data Manipulation
Language and user defined functions.
• Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
APACHE HIVE:
• Apache Hive is a data warehouse system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to MapReduce jobs.
• Hive was developed by Facebook.
• It supports Data definition Language, Data Manipulation Language and user defined
functions.
• Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive
query processing).
• It supports all primitive data types of SQL.
• You can use predefined functions, or write tailored user defined functions (UDF) also
to accomplish your specific needs.
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
• Hive is not capable of handling real-time data.
• It is not designed for online transaction processing.
• Hive queries contain high latency.
Differences between Hive and Pig
What is Hive
Hive is a data warehouse infrastructure tool to process structured
data in Hadoop. It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into
HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or
HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of
Hive:
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL Process of the replacements of traditional approach for MapReduce program. Instead of
Engine writing MapReduce program in Java, we can write a query for MapReduce job
and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Execution Engine Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
Hadoop distributed file system or HBASE are the data storage techniques to store
HDFS or HBASE
data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and
Hadoop.
Step
Operation
No.
Execute Query
1
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2
The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
7 Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
Metadata Ops
7.1
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.
Introduction to NoSQL
However, NoSQL databases may not be suitable for all applications, as they
may not provide the same level of data consistency and transactional guarantees
as traditional relational databases. It is important to carefully evaluate the
specific needs of an application when choosing a database management system.
NoSQL databases are used in real-time web applications and big data and
their use are increasing over time.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a fixed schema and
can accommodate changing data structures without the need for
migrations or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out
by adding more nodes to a database cluster, making them well-suited
for handling large amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in a scalessemi-
structured format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-
value data model, where data is stored as a collection of key-value
pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a
column-based data model, where data is organized into columns
instead of rows.
6. Distributed and high availability: NoSQL databases are often
designed to be highly available and to automatically handle node
failures and data replication across multiple nodes in a database
cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve
data in a flexible and dynamic manner, with support for multiple data
types and changing data structures.
8. Performance: NoSQL databases are optimized for high performance
and can handle a high volume of reads and writes, making them
suitable for big data and real-time applications.
Types of NoSQL database: Types of NoSQL databases and the name of the
database system that falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Column: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database
regularly to handle the data.
In conclusion, NoSQL databases offer several benefits over traditional
relational databases, such as scalability, flexibility, and cost-effectiveness.
However, they also have several drawbacks, such as a lack of standardization,
lack of ACID compliance, and lack of support for complex queries. When
choosing a database for a specific application, it is important to weigh the
benefits and drawbacks carefully to determine the best fit.
MongoDB:
MongoDB history
MongoDB was created by Dwight Merriman and Eliot Horowitz, who encountered
development and scalability issues with traditional relational database approaches while
building web applications at DoubleClick, an online advertising company that is now owned
by Google Inc. The name of the database was derived from the word humongous to represent
the idea of supporting large amounts of data.
Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and
related software. The company was renamed MongoDB Inc. in 2013 and went public in
October 2017 under the ticker symbol MDB.
The DBMS was released as open source software in 2009 and has been kept updated since.
Organizations like the insurance company MetLife have used MongoDB for customer service
applications, while other websites like Craigslist have used it for archiving data. The CERN
physics lab has used it for data aggregation and discovery. Additionally, The New York
Times has used MongoDB to support a form-building application for photo submissions.
Instead of using tables and rows as in relational databases, as a NoSQL database, the
MongoDB architecture is made up of collections and documents. Documents are made up of
Key-value pairs -- MongoDB's basic unit of data. Collections, the equivalent of SQL tables,
contain document sets. MongoDB offers support for many programming languages, such as
C, C++, C#, Go, Java, Python, Ruby and Swift.
Documents contain the data the user wants to store in the MongoDB database.
Documents are composed of field and value pairs. They are the basic unit of data in
MongoDB. The documents are similar to JavaScript Object Notation (JSON) but use
a variant called Binary JSON (BSON). The benefit of using BSON is that it
accommodates more data types. The fields in these documents are like the columns
in a relational database. Values contained can be a variety of data types, including
other documents, arrays and arrays of documents, according to the MongoDB user
manual. Documents will also incorporate a primary key as a unique identifier. A
document's structure is changed by adding or deleting new or existing fields.
Sets of documents are called collections, which function as the equivalent of
relational database tables. Collections can contain any type of data, but the
restriction is the data in a collection cannot be spread across different databases.
Users of MongoDB can create multiple databases with multiple collections.
The NoSQL DBMS uses a single master architecture for data consistency, with
secondary databases that maintain copies of the primary database. Operations are
automatically replicated to those secondary databases for automatic failover.
Data integration. This integrates data for applications, including for hybrid and multi-
cloud applications.
Features of MongoDB
Replication. A replica set is two or more MongoDB instances used to provide high
availability. Replica sets are made of primary and secondary servers. The primary
MongoDB server performs all the read and write operations, while the secondary replica
keeps a copy of the data. If a primary replica fails, the secondary replica is then used.
Scalability. MongoDB supports vertical and horizontal scaling. Vertical scaling works by
adding more power to an existing machine, while horizontal scaling works by adding
more machines to a user's resources.
Load balancing. MongoDB handles load balancing without the need for a separate,
dedicated load balancer, through either vertical or horizontal scaling.
Document. Data in MongoDB is stored in documents with key-value pairs instead of rows
and columns, which makes the data more flexible when compared to SQL databases.
Advantages of MongoDB
Third-party support. MongoDB supports several storage engines and provides pluggable
storage engine APIs that let third parties develop their own storage engines for
MongoDB.
Aggregation. The DBMS also has built-in aggregation capabilities, which lets users
run MapReduce code directly on the database rather than running MapReduce
on Hadoop. MongoDB also includes its own file system called GridFS, akin to the Hadoop
Distributed File System. The use of the file system is primarily for storing files larger than
BSON's size limit of 16 MB per document. These similarities let MongoDB be used
instead of Hadoop, though the database software does integrate with
Hadoop, Spark and other data processing frameworks.
Disadvantages of MongoDB
Though there are some valuable benefits to MongoDB, there are some downsides to it as
well.
Continuity. With its automatic failover strategy, a user sets up just one master node in a
MongoDB cluster. If the master fails, another node will automatically convert to the new
master. This switch promises continuity, but it isn't instantaneous -- it can take up to a
minute. By comparison, the Cassandra NoSQL database supports multiple master nodes.
If one master goes down, another is standing by, creating a highly available database
infrastructure.
Write limits. MongoDB's single master node also limits how fast data can be written to
the database. Data writes must be recorded on the master, and writing new information
to the database is limited by the capacity of that master node.
Data consistency. MongoDB doesn't provide full referential integrity through the use of
foreign-key constraints, which could affect data consistency.
One of the main differences between MongoDB and RDBMS is that RDBMS is a relational
database while MongoDB is nonrelational. Likewise, while most RDBMS systems use SQL
to manage stored data, MongoDB uses BSON for data storage -- a type of NoSQL database.
While RDBMS uses tables and rows, MongoDB uses documents and collections. In RDBMS
a table -- the equivalent to a MongoDB collection -- stores data as columns and rows.
Likewise, a row in RDBMS is the equivalent of a MongoDB document but stores data as
structured data items in a table. A column denotes sets of data values, which is the equivalent
to a field in MongoDB.
MongoDB platforms
A graphical user interface (GUI) named MongoDB Compass gives users a way to work with
document structure, conduct queries, index data and more. The MongoDB Connector for BI
lets users connect the NoSQL database to their business intelligence tools to visualize data
and create reports using SQL queries.
Following in the footsteps of other NoSQL database providers, MongoDB Inc. launched
a cloud database as a service named MongoDB Atlas in 2016. Atlas runs on AWS, Microsoft
Azure and Google Cloud Platform. Later, MongoDB released a platform named Stitch for
application development on MongoDB Atlas, with plans to extend it to on-premises
databases.
NoSQL databases often include document, graph, key-value or wide-column store-based databases.
The company also added support for multi-document atomicity, consistency, isolation, and
durability (ACID) transactions as part of MongoDB 4.0 in 2018. Complying with the ACID
properties across multiple documents expands the types of transactional workloads that
MongoDB can handle with guaranteed accuracy and reliability.
MongoDB history
MongoDB was created by Dwight Merriman and Eliot Horowitz, who encountered
development and scalability issues with traditional relational database approaches while
building web applications at DoubleClick, an online advertising company that is now owned
by Google Inc. The name of the database was derived from the word humongous to represent
the idea of supporting large amounts of data.
Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and
related software. The company was renamed MongoDB Inc. in 2013 and went public in
October 2017 under the ticker symbol MDB.
The DBMS was released as open source software in 2009 and has been kept updated since.
Organizations like the insurance company MetLife have used MongoDB for customer service
applications, while other websites like Craigslist have used it for archiving data. The CERN
physics lab has used it for data aggregation and discovery. Additionally, The New York
Times has used MongoDB to support a form-building application for photo submissions.