0% found this document useful (0 votes)
41 views37 pages

BD Notes 5

mini project

Uploaded by

gudalasubbu143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views37 pages

BD Notes 5

mini project

Uploaded by

gudalasubbu143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

HADOOP ECOSYSTEM

• Hadoop Ecosystem is neither a programming language nor a service, it is a platform


or framework which solves big data problems.
• Can be considered as a suite which encompasses a number of services (ingesting,
storing, analyzing and maintaining) inside it.
• Hadoop Ecosystem is a platform or a suite which provides various services to solve
the big data problems.
• It includes Apache projects and various commercial tools and solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities.
• Most of the tools or solutions are used to supplement or support these major
elements.
• All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.

Difference between Hadoop vs Hadoop Ecosystem

• Hadoop is a framework that manages big data storage by means of parallel and distributed
processing.
• Hadoop is comprised of various tools and frameworks that are dedicated to different
sections of data management, like storing, processing, and analyzing.
• The Hadoop ecosystem covers Hadoop itself and various other related big data tools.
• Hadoop is a programming framework used in the world of big data to solve significant
big data challenges such as storing and processing.
• The Hadoop ecosystem consists of tools and frameworks that can integrate with
Hadoop. There are a lot of tools that come under the Hadoop ecosystem and each of
them has its own functionalities.

Some of the tools are:

• HDFS
• MapReduce and YARN
• Apache Spark -> In-memory Data Processing
• Sqoop and Flume for data collection and ingestion
• Hive and Pig for query-based processing
• HBase and Mongo DB for NoSQL database
• Mahout and Spark MLlib for machine learning algorithms
• Solar, Lucene: Searching and Indexing
• Zookeeper for managing cluster
• Oozie for job scheduling

Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.

• Apache Spark's streaming APIs allow for real-time data ingestion, while Hadoop
MapReduce can store and process the data within the architecture.
• Spark can then be used to perform real-time stream processing or batch processing
on the data stored in Hadoop
• Apache Spark is a framework for real time data analytics in a distributed
computing environment.
• The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
• It executes in-memory computations to increase speed of data processing over
MapReduce.
• It is 100x faster than Hadoop for large scale data processing by exploiting in-
memory computations and other optimizations. Therefore, it requires high
processing power than Map-Reduce
• As you can see, Spark comes packed with high-level libraries, including support
for R, SQL, Python, Scala, Java etc.
• These standard libraries increase the seamless integrations in complex workflow.
• Over this, it also allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
• When we combine, Apache Spark’s ability, i.e. high processing speed, advance
analytics and multiple integration support with Hadoop’s low-cost operation on
commodity hardware, it gives the best results.
• That is the reason why, Spark and Hadoop are used together by many companies for
processing and analyzing their Big Data stored in HDFS
• It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages.
• Spark comes up with 80 high-level operators for interactive querying.

Features of Spark:
• In memory processing
• Tight Integration Of component
• Easy and In-expensive
• The powerful processing engine makes it so fast
• Spark Streaming has high level library for streaming process

Features of Apache Spark

Introduction
Apache Spark has many features which make it a great choice as a big data processing engine. Many of

these features establish the advantages of Apache Spark over other Big Data processing engines. Let us

look into details of some of the main features which distinguish it from its competition.

 Fault tolerance

 Dynamic In Nature

 Lazy Evaluation

 Real-Time Stream Processing

 Speed

 Reusability

 Advanced Analytics

 In Memory Computing

 Supporting Multiple languages

 Integrated with Hadoop

 Cost efficient

1. Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault

tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the lineage of all the

transformations and actions needed to complete a task. So in the event of a worker node failure, the

same results can be achieved by rerunning the steps from the existing DAG.

2. Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.

3. Lazy Evaluation: Spark does not evaluate any transformation immediately. All

the transformations are lazily evaluated. The transformations are added to the DAG and the final

computation or results are available only when actions are called. This gives Spark the ability to make

optimization decisions, as all the transformations become visible to the Spark engine before performing

any action.
4. Real Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to

stream processing, letting you write streaming jobs the same way you write batch jobs.

5. Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to

10x faster on disk. Spark achieves this by minimizing disk read/write operations for intermediate

results. It stores in memory and performs disk operations only when essential. Spark achieves this

using DAG, query optimizer and highly optimized physical execution engine.

6. Reusability: Spark code can be used for batch-processing, joining streaming data against historical

data as well as running ad-hoc queries on streaming state.

7. Advanced Analytics: Apache Spark has rapidly become the de facto standard for big data processing

and data sciences across multiple industries. Spark provides both machine learning and graph

processing libraries, which companies across sectors leverage to tackle complex problems. And all this

is easily done using the power of Spark and highly scalable clustered computers. Databricks provides

an Advanced Analytics platform with Spark.

8. In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in

memory and it is not required to write back intermediate results to the disk. This feature gives massive

speed to Spark processing. Over and above this, Spark is also capable of caching the intermediate

results so that it can be reused in the next iteration. This gives Spark added performance boost for any

iterative and repetitive processes, where results in one step can be used later, or there is a common

dataset which can be used across multiple tasks.

9. Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has most of the

APIs available in Java, Scala, Python and R. Also, there are advanced features available with R

language for data analytics. Also, Spark comes with SparkSQL which has an SQL like feature. SQL

developers find it therefore very easy to use, and the learning curve is reduced to a great level.

10. Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system HDFS. It offers

support to multiple file formats like parquet, json, csv, ORC, Avro etc. Hadoop can be easily leveraged

with Spark as an input data source or destination.

11. Cost efficient: Apache Spark is an open source software, so it does not have any licensing fee

associated with it. Users have to just worry about the hardware cost. Also, Apache Spark reduces a lot

of other costs as it comes inbuilt for stream processing, ML and Graph processing. Spark does not have

any locking with any vendor, which makes it very easy for organizations to pick and choose Spark

features as per their use case.

Conclusion
After looking at these features above it can be easily said that Apache Spark is the most advanced and

popular product from Apache which caters to Big Data processing. It has different modules for Machine

Learning, Streaming and Structured and Unstructured data processing.

There are five main components of Apache Spark:


• Apache Spark Core: It is responsible for functions like scheduling, input and
output operations, task dispatching, etc.
• Spark SQL: This is used to gather information about structured data and how the
data is processed.
• Spark Streaming: This component enables the processing of live data streams.
• Machine Learning Library: The goal of this component is scalability and to make
machine learning more accessible.
• GraphX: This has a set of APIs that are used for facilitating graph analytics tasks.

Advantage of Spark:
1. Perfect for interactive processing, iterative processing and event steam processing
2. Flexible and powerful
3. Supports for sophisticated analytics
4. Executes batch processing jobs faster than MapReduce
5. Run on Hadoop alongside other tools in the Hadoop ecosystem

Disadvantage of Spark:
1. Consumes a lot of memory
2. Issues with small file
3. Less number of algorithms
4. Higher latency compared to Apache fling

Apache Spark: The New ‘King’ of Big Data

Apache Spark is a lightning-fast unified analytics engine for big data and machine
learning. It is the largest open-source project in data processing. Since its release, it has
met the enterprise’s expectations in a better way in regards to querying, data processing
and moreover generating analytics reports in a better and faster way. Internet substations
like Yahoo, Netflix, and eBay, etc have used Spark at large scale. Apache Spark is
considered as the future of Big Data Platform.

If you want to know more about the structured data, semi-structured & unstructured data,
check out our blog post - types of big data.

Pros and Cons of Apache Spark


Advantages Disadvantages

Speed No automatic optimization process

Ease of Use File Management System

Advanced Analytics Fewer Algorithms

Dynamic in Nature Small Files Issue


Apache Spark

Multilingual Window Criteria

Doesn’t suit for a multi-user


Apache Spark is powerful
environment

Increased access to Big data -

Demand for Spark


-
Developers
Apache Spark has transformed the world of Big Data. It is the most active big data tool
reshaping the big data market. This open-source distributed computing platform offers
more powerful advantages than any other proprietary solutions. The diverse advantages of
Apache Spark make it a very attractive big data framework.

Apache Spark has huge potential to contribute to the big data-related business in the
industry. Let’s now have a look at some of the common benefits of Apache Spark:

Benefits of Apache Spark:

1. Speed
2. Ease of Use
3. Advanced Analytics
4. Dynamic in Nature
5. Multilingual
6. Apache Spark is powerful
7. Increased access to Big data
8. Demand for Spark Developers
9. Open-source community
1. Speed:

When comes to Big Data, processing speed always matters. Apache Spark is wildly
popular with data scientists because of its speed. Spark is 100x faster than Hadoop for
large scale data processing. Apache Spark uses in-memory(RAM) computing system
whereas Hadoop uses local memory space to store data. Spark can handle multiple
petabytes of clustered data of more than 8000 nodes at a time.
2. Ease of Use:

Apache Spark carries easy-to-use APIs for operating on large datasets. It offers over 80
high-level operators that make it easy to build parallel apps.

The below pictorial representation will help you understand the importance of Apache
Spark.

3. Advanced Analytics:

Spark not only supports ‘MAP’ and ‘reduce’. It also supports Machine learning (ML),
Graph algorithms, Streaming data, SQL queries, etc.

4. Dynamic in Nature:

With Apache Spark, you can easily develop parallel applications. Spark offers you over 80
high-level operators.

5. Multilingual:

Apache Spark supports many languages for code writing such as Python, Java, Scala, etc.

6. Apache Spark is powerful:

Apache Spark can handle many analytics challenges because of its low-latency in-memory
data processing capability. It has well-built libraries for graph analytics algorithms and
machine learning.
7. Increased access to Big data:

Apache Spark is opening up various opportunities for big data and making As per the
recent survey conducted by IBM’s announced that it will educate more than 1 million data
engineers and data scientists on Apache Spark.

8. Demand for Spark Developers:

Apache Spark not only benefits your organization but you as well. Spark developers are so
in-demand that companies offering attractive benefits and providing flexible work timings
just to hire experts skilled in Apache Spark. As per PayScale the average salary for Data
Engineer with Apache Spark skills is $100,362. For people who want to make a career in
the big data, technology can learn Apache Spark. You will find various ways to bridge
the skills gap for getting data-related jobs, but the best way is to take formal training
which will provide you hands-on work experience and also learn through hands-on
projects.

9. Open-source community:

The best thing about Apache Spark is, it has a massive Open-source community behind it.

Apache Spark is Great, but it’s not perfect - How?

Apache Spark is a lightning-fast cluster computer computing technology designed for fast
computation and also being widely used by industries. But on the other side, it also has
some ugly aspects. Here are some challenges related to Apache Spark that developers face
when working on Big data with Apache Spark.

Let’s read out the following limitations of Apache Spark in detail so that you can make an
informed decision whether this platform will be the right choice for your upcoming big
data project.

1. No automatic optimization process


2. File Management System
3. Fewer Algorithms
4. Small Files Issue
5. Window Criteria
6. Doesn’t suit for a multi-user environment
1. No automatic optimization process:

In the case of Apache Spark, you need to optimize the code manually since it doesn’t have
any automatic code optimization process. This will turn into a disadvantage when all the
other technologies and platforms are moving towards automation.

2. File Management System:

Apache Spark doesn’t come with its own file management system. It depends on some
other platforms like Hadoop or other cloud-based platforms.

3. Fewer Algorithms:

There are fewer algorithms present in the case of Apache Spark Machine Learning Spark
MLlib. It lags behind in terms of a number of available algorithms.

4. Small Files Issue:

One more reason to blame Apache Spark is the issue with small files. Developers come
across issues of small files when using Apache Spark along with Hadoop. Hadoop
Distributed File System (HDFS) provides a limited number of large files instead of a large
number of small files.

5. Window Criteria:

Data in Apache Spark divides into small batches of a predefined time interval. So Apache
won't support record-based window criteria. Rather, it offers time-based window criteria.

6. Doesn’t suit for a multi-user environment:

Yes, Apache Spark doesn’t fit for a multi-user environment. It is not capable of handling
more users concurrency.

Conclusion

To sum up, in light of the good, the bad and the ugly, Spark is a conquering tool when we
view it from outside. We have seen a drastic change in the performance and decrease in
the failures across various projects executed in Spark. Many applications are being moved
to Spark for the efficiency it offers to developers. Using Apache Spark can give any
business a boost and help foster its growth. It is sure that you will also have a bright
future!

Resilient Distributed Datasets (RDD)


• It is a fundamental data structure of Spark.
• It is an immutable (Unchanging) distributed collection of objects.
• Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster.
• It is a key tool for data computation.
• It enables you to recheck data in the event of a failure, and it acts as an interface
for immutable data.
• It helps in recomputing data in case of failures, and it is a data structure.
• There are two methods for modifying RDDs
o Transformations
o Actions.
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to
recompute missing or damaged partitions due to node failures.
Directed acyclic graph (DAG)
• It is a conceptual representation of a series of activities.
• The order of the activities is depicted by a graph, which is visually presented as a set of
circles, each one representing an activity, some of which are connected by lines, which
represent the flow from one activity to another.
Distributed, since Data resides on multiple nodes.
Dataset
• It represents records of the data you work with.
• The user can load the data set externally ‘oi,6which can be either JSON file, CSV file,
text file or database via JDBC with no specific data structure.
There are three ways to create RDDs in Spark

• Data in stable storage, other RDDs, and


• parallelizing already existing collection in driver program.
• One can also operate Spark RDDs in parallel with a low-level API that
offers transformations and actions.

Necessity of RDD in Spark :

• Iterative algorithms.
• Interactive data mining tools.
• DSM (Distributed Shared Memory) is a very general abstraction, but this generality
makes it harder to implement in an efficient and fault tolerant manner on commodity
clusters. Here the need of RDD comes into the picture.
• In distributed computing system data is stored in intermediate stable distributed
store such as HDFS or Amazon S3

APACHE ZOOKEEPER
• Apache Zookeeper is an open-source distributed coordination service that helps to
manage a large set of hosts.
• Management and coordination in a distributed environment is tricky.
• Zookeeper automates this process and allows developers to focus on building software
features rather than worry about it’s distributed nature.
• Zookeeper helps you to maintain configuration information, naming, group services
for distributed applications.
• It implements different protocols on the cluster so that the application should not
implement on their own.
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination
of various services in a Hadoop Ecosystem.
• Before Zookeeper, it was very difficult and time consuming to coordinate between
different services in Hadoop Ecosystem.
• The services earlier had many problems with interactions like common configuration
while synchronizing data.
• Even if the services are configured, changes in the configurations of the services make it
complex and difficult to handle.
• The grouping and naming was also a time-consuming factor.
• Due to the above problems, Zookeeper was introduced.
• It saves a lot of time by performing synchronization, configuration maintenance,
grouping and naming.
• Although it’s a simple service, it can be used to build powerful solutions.
• Zookeeper in Hadoop can be considered a centralized repository where distributed
applications can put data into and retrieve data from.
• It makes a distributed system work together as a whole using its synchronization,
serialization, and coordination goals.
• For clarity, Zookeeper can be thought of as a file system where we have nodes that
store data instead of files or directories that store data.
• Zookeeper is a Hadoop Admin tool used to manage jobs in a cluster.
• For example, Apache Storm, which Twitter uses to store machine state data, has
Apache Zookeeper as a coordinator between machines.

Why Apache Zookeeper?


Here, are important reasons behind the popularity of the Zookeeper:
• It allows for mutual exclusion and cooperation between server processes.
• It ensures that your application runs consistently.
• The transaction process is never complete partially. It is either given the status of
Success or failure. The distributed state can be held up, but it’s never wrong
• Irrespective of the server that it connects to, a client will be able to see the same view
of the service. It provides a single coherent view of multiple machines.
• Helps you to encode the data as per the specific set of rules
• It helps to maintain a standard hierarchical namespace similar to files and
directories
• Computers, which run as a single system, can be locally or geographically connected
• It allows to Join/leave node in a cluster and node status at the real time
• You can increase performance by deploying more machines
• It allows you to elect a node as a leader for better coordination
• ZooKeeper works fast with workloads where reads to the data are more common
than writes

Why do we need a Zookeeper in Hadoop ?


• Distributed applications are difficult to coordinate and work with because they are
much more prone to errors due to the large number of machines connected to the
network.
• Because many machines are involved, race conditions and deadlocks are common
problems when implementing distributed applications.
• A race condition occurs when a machine tries to perform two or more operations at
once, and this can be resolved with Zookeeper’s serialization feature.

• A deadlock occurs when two or more computers attempt to access the same shared
resource simultaneously. More precisely, they try to access each other’s resources,
which leads to a deadlock because neither system releases the resource but waits for
the other system to release it. Synchronization in Zookeeper helps resolve
deadlocks.

• Another major problem with a distributed application is partial process failure,


leading to data inconsistency.

• Zookeeper handles this with atomicity, meaning either the entire process terminates,
or nothing is left after failure.

• So Zookeeper is an important part of Hadoop that takes care of these small but
important matters so that the developer can focus more on the application’s
functionality.

How does Zookeeper Architecture work in Hadoop?

• Hadoop Zookeeper Architecture is a distributed application that follows a simple


client-server model, where clients are the nodes that consume the service and
servers are the nodes that provide the service.
• Multiple server nodes are collectively called a ZooKeeper file.
• A zookeeper client uses at least one server at a given time.
The master node is dynamically selected based on consensus within the ensemble, so the
Zookeeper file is usually an odd number, so that’s where the majority of votes are. If the
master node fails, another master node is instantly selected and takes over from the
previous master node.
In addition to masters and slaves, there are also observers in Zookeeper. Observers were
invited to address the scaling issue. The addition of slaves affected writing performance
because the voting process was expensive. Observers are, therefore, slaves who do not
participate in voting but have similar duties to other slaves.
Writes in Zookeeper Architecture
All writes to Zookeeper Architecture go through the master node, so all writes are
guaranteed to be sequential. When performing a write operation to Zookeeper, each
server connected to this client stores data along with the master server. This keeps all
servers updated with the data. However, this also means that concurrent writes cannot be
performed. The linear write guarantee can be problematic if Zookeeper is used for the
main write load.
Zookeeper in Hadoop is ideally used to coordinate message exchange between clients,
which involves fewer writes and more reads. Zookeeper is useful until the data is shared,
but if the application has concurrent data writes, Zookeeper can get in the way and
impose a strict order of operations.
Reads in Zookeeper Architecture
Zookeeper is best at reading because reading can be concurrent. Concurrent reads are
performed because each client is connected to a different server, and all clients can read
from the servers simultaneously. However, concurrent reads result in ultimate
consistency because the master server is not involved. There may be cases where the
client may have an outdated view that updates with a small delay.

APACHE SQOOP
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases.
• While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
• When we submit Sqoop command, our main task gets divided into sub tasks which is
handled by individual Map Task internally. Map Task is the sub task, which imports
part of data to the Hadoop Ecosystem. Collectively, all Map tasks imports the whole
data. Export also works in a similar manner.
• When we submit our Job, it is mapped into Map Tasks which brings the chunk of data
from HDFS. These chunks are exported to a structured data destination. Combining
all these exported chunks of data, we receive the whole data at the destination, which
in most of the cases is an RDBMS (MYSQL/Oracle/SQL Server).

Apache SQOOP
Sqoop is defined as the tool which is used to perform data transfer operations from relational
database management system to Hadoop server.
Sqoop Driver :
• Basically, Sqoop “driver” simply refers to a JDBC Driver. Moreover, JDBC is nothing
but a standard Java API for accessing relational databases and some data warehouses.
Sqoop Connectors :
• Every database has its own dialect of SQL, there is a standard prescribing how the
language should look like.
• Dialect is a particular version of programming language
• Sqoop Connectors, Sqoop can overcome the differences in SQL dialects supported by
various databases along with providing optimized data transfer. To be more specific
connector is a pluggable piece (pluggable means included as part of runtime ).
• we use connectors to fetch metadata about transferred data (columns, associated data
types, …).
• Basic connector that is shipped with Sqoop is Generic JDBC Connector in Sqoop.
However, by name, it’s using only the JDBC interface for accessing metadata and
transferring data
• SQL is a very general query processing language. So, we can say for importing
data or exporting data out of the database server

Apache Pig - Architecture


The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high-
level data processing language which provides a rich set of data types and operators to
perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us
take a look at the major components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data
model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as
a field.
Example − ‘raja’ or ‘30’
Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)


Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is


known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}


A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).

APACHE HIVE:
• Apache Hive is a data warehouse system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to MapReduce jobs. Hive
was developed by Facebook. It supports Data definition Language, Data Manipulation
Language and user defined functions.
• Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
APACHE HIVE:
• Apache Hive is a data warehouse system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to MapReduce jobs.
• Hive was developed by Facebook.
• It supports Data definition Language, Data Manipulation Language and user defined
functions.
• Basically, HIVE is a data warehousing component which performs reading, writing
and managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive
query processing).
• It supports all primitive data types of SQL.
• You can use predefined functions, or write tailored user defined functions (UDF) also
to accomplish your specific needs.
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
• Hive is not capable of handling real-time data.
• It is not designed for online transaction processing.
• Hive queries contain high latency.
Differences between Hive and Pig
What is Hive
Hive is a data warehouse infrastructure tool to process structured
data in Hadoop. It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic
MapReduce.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
 It stores schema in a database and processed data into
HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or
HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of
Hive:

This component diagram contains different units. The following


table describes each unit:
Unit Name Operation

Hive is a data warehouse infrastructure software that can create interaction


User Interface between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight (In Windows server).

Hive chooses respective database servers to store the schema or Metadata of


Meta Store
tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL Process of the replacements of traditional approach for MapReduce program. Instead of
Engine writing MapReduce program in Java, we can write a query for MapReduce job
and process it.

The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Execution Engine Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.

Hadoop distributed file system or HBASE are the data storage techniques to store
HDFS or HBASE
data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and
Hadoop.

The following table defines how Hive interacts with Hadoop


framework:

Step
Operation
No.

Execute Query
1
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

Get Plan
2
The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).

Send Metadata
4
Metastore sends metadata as a response to the compiler.

Send Plan
5
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.

Execute Plan
6
The driver sends the execute plan to the execution engine.

Execute Job
7 Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.

Metadata Ops
7.1
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.

Fetch Result
8
The execution engine receives the results from Data nodes.

Send Results
9
The execution engine sends those resultant values to the driver.

Send Results
10
The driver sends the results to Hive Interfaces.

Introduction to NoSQL


NoSQL is a type of database management system (DBMS) that is designed to


handle and store large volumes of unstructured and semi-structured data. Unlike
traditional relational databases that use tables with pre-defined schemas to store
data, NoSQL databases use flexible data models that can adapt to changes in
data structures and are capable of scaling horizontally to handle growing
amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational”
databases, but the term has since evolved to mean “not only SQL,” as NoSQL
databases have expanded to include a wide range of different database
architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured
documents, such as JSON or XML, and can be queried using
document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and
are optimized for simple and fast read/write operations.
3. Column-family stores: These databases store data as column
families, which are sets of columns that are treated as a single entity.
They are optimized for fast and efficient querying of large amounts of
data.
4. Graph databases: These databases store data as nodes and edges, and
are designed to handle complex relationships between data.
5. Time series — A time series database is designed to store and retrieve data
records that are sequenced by time, which are sets of data points that are
associated with timestamps and stored in time sequence order. Time series
databases make it easy to measure how measurements or events change over
time; for example, temperature readings from weather sensors or intraday stock
prices. AWS offers Amazon Timestream as a managed time series database
service.
NoSQL databases are often used in applications where there is a high volume of
data that needs to be processed and analyzed in real-time, such as social media
analytics, e-commerce, and gaming. They can also be used for other
applications, such as content management systems, document management, and
customer relationship management.

However, NoSQL databases may not be suitable for all applications, as they
may not provide the same level of data consistency and transactional guarantees
as traditional relational databases. It is important to carefully evaluate the
specific needs of an application when choosing a database management system.

NoSQL originally referring to non SQL or non relational is a database that


provides a mechanism for storage and retrieval of data. This data is modeled in
means other than the tabular relations used in relational databases. Such
databases came into existence in the late 1960s, but did not obtain the NoSQL
moniker until a surge of popularity in the early twenty-first century.

NoSQL databases are used in real-time web applications and big data and
their use are increasing over time.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a fixed schema and
can accommodate changing data structures without the need for
migrations or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out
by adding more nodes to a database cluster, making them well-suited
for handling large amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in a scalessemi-
structured format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-
value data model, where data is stored as a collection of key-value
pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a
column-based data model, where data is organized into columns
instead of rows.
6. Distributed and high availability: NoSQL databases are often
designed to be highly available and to automatically handle node
failures and data replication across multiple nodes in a database
cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve
data in a flexible and dynamic manner, with support for multiple data
types and changing data structures.
8. Performance: NoSQL databases are optimized for high performance
and can handle a high volume of reads and writes, making them
suitable for big data and real-time applications.

Advantages of NoSQL: There are many advantages of working with NoSQL


databases such as MongoDB and Cassandra. The main advantages are high
scalability and high availability.
1. High scalability: NoSQL databases use sharding for horizontal
scaling. Partitioning of data and placing it on multiple machines in
such a way that the order of the data is preserved is sharding.
Vertical scaling means adding more resources to the existing
machine whereas horizontal scaling means adding more machines to
handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal
scaling databases are MongoDB, Cassandra, etc. NoSQL can handle
a huge amount of data because of scalability, as the data grows
NoSQL scalesThe auto itself to handle that data in an efficient
manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or
semi-structured data, which means that they can accommodate
dynamic changes to the data model. This makes NoSQL databases a
good fit for applications that need to handle changing data
requirements.
3. High availability: The auto, replication feature in NoSQL databases
makes it highly available because in case of any failure data
replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that
they can handle large amounts of data and traffic with ease. This
makes them a good fit for applications that need to handle large
amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large
amounts of data and traffic, which means that they can offer
improved performance compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective
than traditional relational databases, as they are typically less
complex and do not require expensive hardware or software.
7. Agility: Ideal for agile development.

Disadvantages of NoSQL: NoSQL has the following disadvantages.


1. Lack of standardization: There are many different types of NoSQL
databases, each with its own unique strengths and weaknesses. This
lack of standardization can make it difficult to choose the right
database for a specific application
2. Lack of ACID compliance: NoSQL databases are not fully ACID-
compliant, which means that they do not guarantee the consistency,
integrity, and durability of data. This can be a drawback for
applications that require strong data consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is
mainly designed for storage but it provides very little functionality.
Relational databases are a better choice in the field of Transaction
Management than NoSQL.
4. Open-source: NoSQL is an database open-source database. There is
no reliable standard for NoSQL yet. In other words, two database
systems are likely to be unequal.
5. Lack of support for complex queries: NoSQL databases are not
designed to handle complex queries, which means that they are not a
good fit for applications that require complex data analysis or
reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the
maturity of traditional relational databases. This can make them less
reliable and less secure than traditional databases.
7. Management challenge: The purpose of big data tools is to make
the management of a large amount of data as simple as possible. But
it is not so easy. Data management in NoSQL is much more complex
than in a relational database. NoSQL, in particular, has a reputation
for being challenging to install and even more hectic to manage on a
daily basis.
8. GUI is not available: GUI mode tools to access the database are not
flexibly available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases
like MongoDB. MongoDB has no approach for the backup of data in
a consistent manner.
10.Large document size: Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are
quite large (BigData, network bandwidth, speed), and having
descriptive key names actually hurts since they increase the
document size.

Types of NoSQL database: Types of NoSQL databases and the name of the
database system that falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Column: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database
regularly to handle the data.
In conclusion, NoSQL databases offer several benefits over traditional
relational databases, such as scalability, flexibility, and cost-effectiveness.
However, they also have several drawbacks, such as a lack of standardization,
lack of ACID compliance, and lack of support for complex queries. When
choosing a database for a specific application, it is important to weigh the
benefits and drawbacks carefully to determine the best fit.

MongoDB:
MongoDB history

MongoDB was created by Dwight Merriman and Eliot Horowitz, who encountered
development and scalability issues with traditional relational database approaches while
building web applications at DoubleClick, an online advertising company that is now owned
by Google Inc. The name of the database was derived from the word humongous to represent
the idea of supporting large amounts of data.

Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and
related software. The company was renamed MongoDB Inc. in 2013 and went public in
October 2017 under the ticker symbol MDB.

The DBMS was released as open source software in 2009 and has been kept updated since.

Organizations like the insurance company MetLife have used MongoDB for customer service
applications, while other websites like Craigslist have used it for archiving data. The CERN
physics lab has used it for data aggregation and discovery. Additionally, The New York
Times has used MongoDB to support a form-building application for photo submissions.

Instead of using tables and rows as in relational databases, as a NoSQL database, the
MongoDB architecture is made up of collections and documents. Documents are made up of
Key-value pairs -- MongoDB's basic unit of data. Collections, the equivalent of SQL tables,
contain document sets. MongoDB offers support for many programming languages, such as
C, C++, C#, Go, Java, Python, Ruby and Swift.

How does MongoDB work?

MongoDB environments provide users with a server to create databases with


MongoDB. MongoDB stores data as records that are made up of collections and
documents.

Documents contain the data the user wants to store in the MongoDB database.
Documents are composed of field and value pairs. They are the basic unit of data in
MongoDB. The documents are similar to JavaScript Object Notation (JSON) but use
a variant called Binary JSON (BSON). The benefit of using BSON is that it
accommodates more data types. The fields in these documents are like the columns
in a relational database. Values contained can be a variety of data types, including
other documents, arrays and arrays of documents, according to the MongoDB user
manual. Documents will also incorporate a primary key as a unique identifier. A
document's structure is changed by adding or deleting new or existing fields.
Sets of documents are called collections, which function as the equivalent of
relational database tables. Collections can contain any type of data, but the
restriction is the data in a collection cannot be spread across different databases.
Users of MongoDB can create multiple databases with multiple collections.

The mongo shell is a standard component of the open-source distributions of


MongoDB. Once MongoDB is installed, users connect the mongo shell to their
running MongoDB instances. The mongo shell acts as an
interactive JavaScript interface to MongoDB, which allows users to query or update
data and conduct administrative operations.

A binary representation of JSON-like documents is provided by the BSON document


storage and data interchange format. Automatic sharding is another key feature that
enables data in a MongoDB collection to be distributed across multiple systems for
horizontal scalability, as data volumes and throughput requirements increase.

The NoSQL DBMS uses a single master architecture for data consistency, with
secondary databases that maintain copies of the primary database. Operations are
automatically replicated to those secondary databases for automatic failover.

Why is MongoDB used?

An organization might want to use MongoDB for the following:


 Storage. MongoDB can store large structured and unstructured data volumes and is
scalable vertically and horizontally. Indexes are used to improve search performance.
Searches are also done by field, range and expression queries.

 Data integration. This integrates data for applications, including for hybrid and multi-
cloud applications.

 Complex data structures descriptions. Document databases enable the embedding of


documents to describe nested structures (a structure within a structure) and can
tolerate variations in data.

 Load balancing. MongoDB can be used to run over multiple servers.

Features of MongoDB

Features of MongoDB include the following:

 Replication. A replica set is two or more MongoDB instances used to provide high
availability. Replica sets are made of primary and secondary servers. The primary
MongoDB server performs all the read and write operations, while the secondary replica
keeps a copy of the data. If a primary replica fails, the secondary replica is then used.

 Scalability. MongoDB supports vertical and horizontal scaling. Vertical scaling works by
adding more power to an existing machine, while horizontal scaling works by adding
more machines to a user's resources.

 Load balancing. MongoDB handles load balancing without the need for a separate,
dedicated load balancer, through either vertical or horizontal scaling.

 Schema-less. MongoDB is a schema-less database, which means the database can


manage data without the need for a blueprint.

 Document. Data in MongoDB is stored in documents with key-value pairs instead of rows
and columns, which makes the data more flexible when compared to SQL databases.

Advantages of MongoDB

MongoDB offers several potential benefits:

 Schema-less. Like other NoSQL databases, MongoDB doesn't require


predefined schemas. It stores any type of data. This gives users the flexibility to create
any number of fields in a document, making it easier to scale MongoDB databases
compared to relational databases.

 Document-oriented. One of the advantages of using documents is that these objects


map to native data types in several programming languages., Having embedded
documents also reduces the need for database joins, which can lower costs.

 Scalability. A core function of MongoDB is its horizontal scalability, which makes it a


useful database for companies running big data applications. In addition, sharding lets
the database distribute data across a cluster of machines. MongoDB also supports the
creation of zones of data based on a shard key.

 Third-party support. MongoDB supports several storage engines and provides pluggable
storage engine APIs that let third parties develop their own storage engines for
MongoDB.

 Aggregation. The DBMS also has built-in aggregation capabilities, which lets users
run MapReduce code directly on the database rather than running MapReduce
on Hadoop. MongoDB also includes its own file system called GridFS, akin to the Hadoop
Distributed File System. The use of the file system is primarily for storing files larger than
BSON's size limit of 16 MB per document. These similarities let MongoDB be used
instead of Hadoop, though the database software does integrate with
Hadoop, Spark and other data processing frameworks.

Disadvantages of MongoDB

Though there are some valuable benefits to MongoDB, there are some downsides to it as
well.

 Continuity. With its automatic failover strategy, a user sets up just one master node in a
MongoDB cluster. If the master fails, another node will automatically convert to the new
master. This switch promises continuity, but it isn't instantaneous -- it can take up to a
minute. By comparison, the Cassandra NoSQL database supports multiple master nodes.
If one master goes down, another is standing by, creating a highly available database
infrastructure.
 Write limits. MongoDB's single master node also limits how fast data can be written to
the database. Data writes must be recorded on the master, and writing new information
to the database is limited by the capacity of that master node.

 Data consistency. MongoDB doesn't provide full referential integrity through the use of
foreign-key constraints, which could affect data consistency.

 Security. In addition, user authentication isn't enabled by default in MongoDB


databases. However, malicious hackers have targeted large numbers of unsecured
MongoDB systems in attacks, which led to the addition of a default setting that blocks
networked connections to databases if they haven't been configured by a database
administrator.

MongoDB vs. RDBMS: What are the differences?

A relational database management system (RDBMS) is a collection of programs and


capabilities that let IT teams and others create, update, administer and otherwise interact with
a relational database. RDBMSes store data in the form of tables and rows. Although it is not
necessary, RDBMS most commonly uses SQL.

One of the main differences between MongoDB and RDBMS is that RDBMS is a relational
database while MongoDB is nonrelational. Likewise, while most RDBMS systems use SQL
to manage stored data, MongoDB uses BSON for data storage -- a type of NoSQL database.

While RDBMS uses tables and rows, MongoDB uses documents and collections. In RDBMS
a table -- the equivalent to a MongoDB collection -- stores data as columns and rows.
Likewise, a row in RDBMS is the equivalent of a MongoDB document but stores data as
structured data items in a table. A column denotes sets of data values, which is the equivalent
to a field in MongoDB.

MongoDB is also better suited for hierarchical storage.

MongoDB platforms

MongoDB is available in community and commercial versions through vendor MongoDB


Inc. MongoDB Community Edition is the open source release, while MongoDB Enterprise
Server brings added security features, an in-memory storage engine, administration and
authentication features, and monitoring capabilities through Ops Manager.

A graphical user interface (GUI) named MongoDB Compass gives users a way to work with
document structure, conduct queries, index data and more. The MongoDB Connector for BI
lets users connect the NoSQL database to their business intelligence tools to visualize data
and create reports using SQL queries.

Following in the footsteps of other NoSQL database providers, MongoDB Inc. launched
a cloud database as a service named MongoDB Atlas in 2016. Atlas runs on AWS, Microsoft
Azure and Google Cloud Platform. Later, MongoDB released a platform named Stitch for
application development on MongoDB Atlas, with plans to extend it to on-premises
databases.

NoSQL databases often include document, graph, key-value or wide-column store-based databases.

The company also added support for multi-document atomicity, consistency, isolation, and
durability (ACID) transactions as part of MongoDB 4.0 in 2018. Complying with the ACID
properties across multiple documents expands the types of transactional workloads that
MongoDB can handle with guaranteed accuracy and reliability.

MongoDB history

MongoDB was created by Dwight Merriman and Eliot Horowitz, who encountered
development and scalability issues with traditional relational database approaches while
building web applications at DoubleClick, an online advertising company that is now owned
by Google Inc. The name of the database was derived from the word humongous to represent
the idea of supporting large amounts of data.

Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and
related software. The company was renamed MongoDB Inc. in 2013 and went public in
October 2017 under the ticker symbol MDB.

The DBMS was released as open source software in 2009 and has been kept updated since.
Organizations like the insurance company MetLife have used MongoDB for customer service
applications, while other websites like Craigslist have used it for archiving data. The CERN
physics lab has used it for data aggregation and discovery. Additionally, The New York
Times has used MongoDB to support a form-building application for photo submissions.

You might also like