0% found this document useful (0 votes)
21 views

BIG DATA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

BIG DATA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

1 FOUR V^S

Volume: Volume refers to the vast amount of data that is generated and collected. Big data
involves working with data sets that are so large and complex that they cannot be easily
managed or processed using traditional database systems. The volume of data can range
from terabytes to petabytes or even larger.

Velocity: Velocity represents the speed at which data is generated, collected, and processed.
Big data often involves real-time or near-real-time data streams that flow rapidly and
continuously. Examples of high-velocity data sources include social media feeds, sensor data
from Internet of Things (IoT) devices, financial transactions, and website clickstreams.

Variety: Variety refers to the diverse types and formats of data that are included in big data.
This includes structured data (e.g., data stored in traditional databases), unstructured data
(e.g., text documents, emails, social media posts), semi-structured data (e.g., XML or JSON
files), multimedia data (e.g., images, videos), and more. Big data systems need to be able to
handle and analyze this wide range of data formats.

Veracity: Veracity relates to the reliability and trustworthiness of the data. Big data often
includes data from various sources, and the quality and accuracy of the data can vary
significantly. Veracity refers to the challenges of dealing with uncertain, incomplete, or
inconsistent data. It involves ensuring data integrity, addressing data quality issues, and
making accurate interpretations despite the presence of noise or errors in the data.

Note: It's worth mentioning that some discussions on big data include additional Vs such as
value and variability, which emphasize the importance of extracting value from data and
handling data changes over time. However, the original concept of the "4 Vs" focuses on
volume, velocity, variety, and veracity.

2.APPLICATIONS OF BIG DATA?

1. Tracking Customer Spending Habit, Shopping Behavior: In big retails


store (like Amazon, Walmart, Big Bazar etc.) management team has to keep
data of customer’s spending habit (in which product customer spent, in which
brand they wish to spent, how frequently they spent), shopping behavior,
customer’s most liked product (so that they can keep those products in the
store). Which product is being searched/sold most, based on that data,
production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that
they can provide the offer to a particular customer to buy his particular liked
product by using bank’s credit or debit card with discount or cashback. By
this way, they can send the right offer to the right person at the right time.
2. Recommendation: By tracking customer spending habit, shopping
behavior, Big retails store provide a recommendation to the customer. E-
commerce site like Amazon, Walmart, Flipkart does product
recommendation. They track what product a customer is searching, based
on that data they recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So,
Amazon got data that customer may be interested to buy bed cover. Next
time when that customer will go to any google page, advertisement of
various bed covers will be seen. Thus, advertisement of the right product to
the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked,
watched video type. Based on the content of a video, the user is watching,
relevant advertisement is shown during video running. As an example
suppose someone watching a tutorial video of Big data, then advertisement
of some other big data course will be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different
road, collected through camera kept beside the road, at entry and exit point
of the city, GPS device placed in the vehicle (Ola, Uber cab, etc.). All such
data are analyzed and jam-free or less jam way, less time taking ways are
recommended. Such a way smart traffic system can be built in the city by Big
data analysis. One more profit is fuel consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc)
sensors present. These sensors capture data like the speed of flight,
moisture, temperature, other environmental condition. Based on such data
analysis, an environmental parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long
the machine can operate flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human
interpretation. In the various spot of car camera, a sensor placed, that gather
data like the size of the surrounding car, obstacle, distance from those, etc.
These data are being analyzed, then various calculation like how many
angles to rotate, what should be speed, when to stop, etc carried out. These
calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal
assistant tool (like Siri in Apple Device, Cortana in Windows, Google
Assistant in Android) to provide the answer of the various question asked by
users. This tool tracks the location of the user, their local time, season, other
data related to question asked, etc. Analyzing all such data, it provides an
answer.
3.BIG DATA LIFE CYCLE WITH
ANS:

The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
10. Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the
motivation and goals for carrying out the analysis. In this stage, the problem
is identified, and assumptions are made that how much potential gain a
company will make after carrying out the analysis. Important activities in this
step include framing the business problem as an analytics challenge that can
be addressed in subsequent phases. It helps the decision-makers understand
the business resources that will be required to be utilized thereby determining
the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data
problem or not, based on the business requirements in the business case. To
qualify as a big data problem, the business case should be directly related to
one(or more) of the characteristics of volume, velocity, or variety.

11. Phase II Data Definition –


Once the business case is identified, now it’s time to find the appropriate
datasets to work with. In this stage, analysis is done to see what other
companies have done for a similar case.
Depending on the business case and the scope of analysis of the project being
addressed, the sources of datasets can be either external or internal to the
company. In the case of internal datasets, the datasets can include data
collected from internal sources, such as feedback forms, from existing
software, On the other hand, for external datasets, the list includes datasets
from third-party providers.

12. Phase III Data Acquisition and filtration –


Once the source of data is identified, now it is time to gather the data from
such sources. This kind of data is mostly unstructured.Then it is subjected to
filtration, such as removal of the corrupt data or irrelevant data, which is of
no scope to the analysis objective. Here corrupt data means data that may
have missing records, or the ones, which include incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can
be of use in the future, for some other analysis.

13. Phase IV Data Extraction –


Now the data is filtered, but there might be a possibility that some of the
entries of the data might be incompatible, to rectify this issue, a separate
phase is created, known as the data extraction phase. In this phase, the data,
which don’t match with the underlying scope of the analysis, are extracted
and transformed in such a form.

14. Phase V Data Munging –


As mentioned in phase III, the data is collected from various sources, which
results in the data being unstructured. There might be a possibility, that the
data might have constraints, that are unsuitable, which can lead to false
results. Hence there is a need to clean and validate the data.
It includes removing any invalid data and establishing complex validation
rules. There are many ways to validate and clean the data. For example, a
dataset might contain few rows, with null entries. If a similar dataset is
present, then those entries are copied from that dataset, else those rows are
dropped.

15. Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise.
But the data might be spread across multiple datasets, and it is not advisable
to work with multiple datasets. Hence, the datasets are joined together. For
example: If there are two datasets, namely that of a Student Academic section
and Student Personal Details section, then both can be joined together via
common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very
large. Automation can be brought into consideration, so that these things are
executed, without any human intervention.

16. Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the
big data problem, analysis is carried out. Data analysis can be classified as
Confirmatory analysis and Exploratory analysis. In confirmatory analysis, the
cause of a phenomenon is analyzed before. The assumption is called the
hypothesis. The data is analyzed to approve or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions
and confirms whether an assumption was true or not.In an exploratory
analysis, the data is explored to obtain information, why a phenomenon
occurred. This type of analysis answers “why” a phenomenon occurred. This
kind of analysis doesn’t provide definitive, meanwhile, it provides discovery
of patterns.

17. Phase VIII Data Visualization –


Now we have the answer to some questions, using the information from the
data in the datasets. But these answers are still in a form that can’t be
presented to business users. A sort of representation is required to obtains
value or some conclusion from the analysis. Hence, various tools are used to
visualize the data in graphic form, which can easily be interpreted by business
users.
Visualization is said to influence the interpretation of the results. Moreover, it
allows the users to discover answers to questions that are yet to be
formulated.

18. Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business
users to make decisions to utilize the results. The results can be used for
optimization, to refine the business process. It can also be used as an input for
the systems to enhance performance.
https://media.geeksforgeeks.org/wp-
content/uploads/20210903092456/BigDataAnalyticsLifeCycle.jpg

5.cloud and big data?

Introduction

You’ve likely heard the terms “Big Data” and “Cloud Computing” before. If you’re
involved with cloud application development, you may even have experience with
them. The two go hand-in-hand, with many public cloud services performing big data
analytics.
With Software as a Service (SaaS) becoming increasingly popular, keeping up-to-
date with cloud infrastructure best practices and the types of data that can be stored
in large quantities is crucial. We’ll take a look at the differences between cloud
computing and big data, the relationship between them, and why the two are a
perfect match, bringing us lots of new, innovative technologies, such as artificial
intelligence.

The Difference Between Big Data & Cloud Computing

Before discussing how the two go together, it’s important to form a clear distinction
between “Big Data” and “Cloud Computing”. Although they are technically different
terms, they’re often seen together in literature because they interact synergistically
with one another.
 Big Data: This simply refers to the very large sets of data that are
output by a variety of programs. It can refer to any of a large variety
of types of data, and the data sets are usually far too large to peruse
or query on a regular computer.
 Cloud Computing: This refers to the processing of anything,
including Big Data Analytics, on the “cloud”. The “cloud” is just a set
of high-powered servers from one of many providers. They can often
view and query large data sets much more quickly than a standard
computer could.

Essentially, “Big Data” refers to the large sets of data collected, while “Cloud
Computing” refers to the mechanism that remotely takes

The Roles & Relationship Between Big Data & Cloud


Computing

Cloud Computing providers often utilize a “software as a service” model to allow


customers to easily process data. Typically, a console that can take in specialized
commands and parameters is available, but everything can also be done from the
site’s user interface. Some products that are usually part of this package include
database management systems, cloud-based virtual machines and containers,
identity management systems, machine learning capabilities, and more.
In turn, Big Data is often generated by large, network-based systems. It can be in
either a standard or non-standard format. If the data is in a non-standard format,
artificial intelligence from the Cloud Computing provider may be used in addition to
machine learning to standardize the data.
From there, the data can be harnessed through the Cloud Computing platform and
utilized in a variety of ways. For example, it can be searched, edited, and used for
future insights.
This cloud infrastructure allows for real-time processing of Big Data. It can take huge
“blasts” of data from intensive systems and interpret it in real-time. Another common
relationship between Big Data and Cloud Computing is that the power of the cloud
allows Big Data analytics to occur in a fraction of the time it used to.

Big Data & Cloud Computing: A Perfect Match

As you can see, there are infinite possibilities when we combine Big Data and Cloud
Computing! If we simply had Big Data alone, we would have huge data sets that
have a huge amount of potential value just sitting there. Using our computers to
analyze them would be either impossible or impractical due to the amount of time it
would take.
However, Cloud Computing allows us to use state-of-the-art infrastructure and only
pay for the time and power that we use! Cloud application development is also fueled
by Big Data. Without Big Data, there would be far fewer cloud-based applications,
since there wouldn’t be any real necessity for them. Remember, Big Data is often
collected by cloud-based applications, as well!
In short, Cloud Computing services largely exist because of Big Data. Likewise, the
only reason that we collect Big Data is because we have services that are capable of
taking it in and deciphering it, often in a matter of seconds. The two are a perfect
match, since neither would exist without the other!

Conclusion

Finally, it’s important to note that both Big Data and Cloud Computing play a huge
role in our digital society. The two linked together allow people with great ideas but
limited resources a chance at business success. They also allow established
businesses to utilize data that they collect but previously had no way of analyzing.
6.MOBILE BUSINESS INTELLIGENCE IN BIG DATA?
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.

Here's how Mobile BI and big data can intersect:

1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.

2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.

3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.

4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.

5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.

Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.

Here's how Mobile BI and big data can intersect:

1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.

2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.
3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.

4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.

5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.

Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.

Here's how Mobile BI and big data can intersect:

1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.
2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.

3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.

4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.

5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.

Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.

7.HADOOP ECO SYSTEM?

Introduction: Hadoop Ecosystem is a platform or a suite which provides various


services to solve the big data problems. It includes Apache projects and various
commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools
or solutions are used to supplement or support these major elements. All these tools
work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is


responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:

Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce


makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of
algorithms.
 It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts of
Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.
Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of
Google’s BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short
quick span of time. At such times, HBase comes handy as it gives us a
tolerant way of storing limited data

8.MOVIND DATA IN AND OUT FROM HADOOP?


Moving data in and out of Hadoop involves transferring data between the Hadoop
Distributed File System (HDFS) and external data sources. Here are some common
methods for moving data in and out of Hadoop:

1. Hadoop Command-Line Interface (CLI): Hadoop provides a command-line


interface that allows you to interact with HDFS. You can use commands like
`hadoop fs -put` to upload data from the local file system to HDFS, and `hadoop fs -
get` to download data from HDFS to the local file system.

2. Hadoop Distributed Copy (DistCp): DistCp is a Hadoop utility specifically


designed for efficient copying of large amounts of data between Hadoop clusters or
between HDFS and other file systems. It preserves data locality and can handle data
transfers in parallel, making it suitable for large-scale data movement.

3. Hadoop File APIs: Hadoop provides programming APIs (such as Java or Python)
to interact with HDFS programmatically. You can use these APIs to read data from
external sources and write it into HDFS or vice versa. For example, you can use the
HDFS API to write data from an application directly to HDFS or read data from
HDFS and process it in your application.

4. Hadoop Ecosystem Tools: Various tools in the Hadoop ecosystem, such as


Apache Sqoop, Apache Flume, and Apache Kafka, are specifically designed to
facilitate data movement in and out of Hadoop.

- Apache Sqoop: Sqoop is a tool used for efficiently transferring structured data
between Hadoop and relational databases. It supports importing data from databases
into HDFS or exporting data from HDFS back to databases.

- Apache Flume: Flume is a distributed, reliable, and scalable system for


collecting, aggregating, and moving large amounts of streaming data into Hadoop. It
can collect data from various sources and write it to HDFS.

- Apache Kafka: Kafka is a distributed streaming platform that can be used to


publish and subscribe to streams of data. It can act as a data ingestion layer,
collecting data from various sources and making it available for consumption by
Hadoop.
These methods provide different options for moving data into and out of Hadoop,
allowing you to integrate Hadoop with external systems, import data from various
sources into HDFS, export data from HDFS to external systems, or transfer data
between Hadoop clusters. The choice of method depends on your specific use case,
data sources, and requirements.

9.UNDERSTANDING INPUT AND OUTPUT OF MAPREDUCE?>


In the MapReduce programming model, the input and output refer to the data flow
within the MapReduce process. Let's understand the input and output components in
MapReduce:

Input in MapReduce:
1. Input Data: The input data for a MapReduce job is typically stored in the Hadoop
Distributed File System (HDFS) or any other compatible file system accessible by
Hadoop. The input data can be a collection of files, directories, or other data sources.
Each input data source is divided into input splits, which are processed by individual
map tasks.

2. Input Format: The input format specifies how the input data is read and split into
input records for processing by the map tasks. Hadoop provides various input
formats for different types of input data, such as TextInputFormat for plain text files,
SequenceFileInputFormat for binary files, or custom input formats tailored for
specific data formats (e.g., CSV, JSON).

3. RecordReader: The RecordReader is responsible for reading and parsing the input
data splits into key-value pairs, known as input records. It defines how to interpret
the data within each input split and generates key-value pairs to be processed by the
map tasks. The RecordReader provides the input records to the map function.

Output in MapReduce:
1. Mapper Output: The map tasks process the input records and generate
intermediate key-value pairs as their output. Each map task can produce multiple
key-value pairs. The output of the map tasks is collected and sorted by the
MapReduce framework based on the keys.

2. Partitioner: The Partitioner determines which reducer task will receive each
intermediate key-value pair. It partitions the map output based on the keys, ensuring
that all key-value pairs with the same key go to the same reducer. The partitioner
uses the hash value of the key to make the partitioning decision.

3. Reducer Input: The intermediate key-value pairs generated by the mappers are
sent to the reducer tasks as their input. Each reducer receives a subset of the
intermediate data based on the partitioning performed by the partitioner. Reducers
process the data and generate the final output.
4. Output Format: The output format defines how the final output of the MapReduce
job is written. Hadoop provides various output formats such as TextOutputFormat
for writing text files, SequenceFileOutputFormat for binary files, or custom output
formats based on specific requirements. The output format determines the
organization and structure of the final output.

It's important to note that the MapReduce framework abstracts away the
complexities of data distribution, parallel processing, and fault tolerance, handling
the distribution of input data to mappers, the collection and sorting of mapper
outputs, and the distribution of intermediate data to reducers automatically.
Developers need to focus on implementing the map and reduce functions, while the
framework takes care of the input and output aspects. In the MapReduce
programming model, the input and output refer to the data flow within the
MapReduce process. Let's understand the input and output components in
MapReduce:

Input in MapReduce:
1. Input Data: The input data for a MapReduce job is typically stored in the Hadoop
Distributed File System (HDFS) or any other compatible file system accessible by
Hadoop. The input data can be a collection of files, directories, or other data sources.
Each input data source is divided into input splits, which are processed by individual
map tasks.

2. Input Format: The input format specifies how the input data is read and split into
input records for processing by the map tasks. Hadoop provides various input
formats for different types of input data, such as TextInputFormat for plain text files,
SequenceFileInputFormat for binary files, or custom input formats tailored for
specific data formats (e.g., CSV, JSON).

3. RecordReader: The RecordReader is responsible for reading and parsing the input
data splits into key-value pairs, known as input records. It defines how to interpret
the data within each input split and generates key-value pairs to be processed by the
map tasks. The RecordReader provides the input records to the map function.

Output in MapReduce:
1. Mapper Output: The map tasks process the input records and generate
intermediate key-value pairs as their output. Each map task can produce multiple
key-value pairs. The output of the map tasks is collected and sorted by the
MapReduce framework based on the keys.

2. Partitioner: The Partitioner determines which reducer task will receive each
intermediate key-value pair. It partitions the map output based on the keys, ensuring
that all key-value pairs with the same key go to the same reducer. The partitioner
uses the hash value of the key to make the partitioning decision.
3. Reducer Input: The intermediate key-value pairs generated by the mappers are
sent to the reducer tasks as their input. Each reducer receives a subset of the
intermediate data based on the partitioning performed by the partitioner. Reducers
process the data and generate the final output.

4. Output Format: The output format defines how the final output of the MapReduce
job is written. Hadoop provides various output formats such as TextOutputFormat
for writing text files, SequenceFileOutputFormat for binary files, or custom output
formats based on specific requirements. The output format determines the
organization and structure of the final output.

It's important to note that the MapReduce framework abstracts away the
complexities of data distribution, parallel processing, and fault tolerance, handling
the distribution of input data to mappers, the collection and sorting of mapper
outputs, and the distribution of intermediate data to reducers automatically.
Developers need to focus on implementing the map and reduce functions, while the
framework takes care of the input and output aspects.

10 HADOOP ARCHTECTURE?

As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today lots of
Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists
of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework. The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster which Makes Hadoop working
so fast. When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is
used as an input to the Reduce function and after that, we receive our final output.
Let’s understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data.
The Input is a set of Data. The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair. These key-value pairs are now sent
as input to the Reduce(). The Reduce() function then combines this broken Tuples or
key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output
Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is utilized
one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

 RecordReader The purpose of recordreader is to break the records. It is


responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with
it.
 Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function either
does not generate any key-value pair or generate multiple pairs of these
tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow.
It is similar to a Local reducer. The intermediate key-value that are
generated in the Map is combined with the help of this combiner. Using a
combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition. Then partitioner performs it’s(Hashcode) modulus with the
number of reducers(key.hashcode()%(number of reducers)).
Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them
to the Reducer task is known as Shuffling. Using the Shuffling process the
system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort
of process on those key-value depending on its key element.
 OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each record in
a new line, and the key and value in a space-separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is


mainly designed for working on commodity Hardware devices(inexpensive devices),
working on a distributed file system design. HDFS is designed in such a way that it
believes more in storing the data in a large chunk of blocks rather than storing small
data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer
and the other devices present in that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is default
and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is this
file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size.
Means 4 blocks are created each of 128MB except the last one. Hadoop doesn’t
know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the
Linux file system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what
makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in File
blocks that the HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example we
have made 4 file blocks which means that 3 Replica or copy of each file block is
made means total of 4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive
system hardware) which can be crashed at any time. We are not using the
supercomputer for our Hadoop setup. That is why we need such a feature in HDFS
which can make copies of that file blocks for backup purposes, this is known as fault
tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization the
data is very much important than the storage so nobody cares for this extra storage.
You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many
Racks . with the help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while performing the read/write
information which reduces the Network Traffic.
HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations


that are Job scheduling and Resource Management. The Purpose of Job schedular is
to divide a big task into small jobs so that each job can be assigned to various slaves
in a Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps
track of which job is important, which job has more priority, dependencies between
the jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a Hadoop
cluster.
Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files
or we can say the java scripts that we need for all the other components present in a
Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for
running the cluster. Hadoop Common verify that Hardware failure in a Hadoop
cluster is common so it needs to be solved automatically in software by Hadoop
Framework.

12.MAP REDUCE FRAME WORK?

What is MapReduce?

MapReduce is a programming model or pattern within the Hadoop framework that is


used to access big data stored in the Hadoop File System (HDFS). It is a core
component, integral to the functioning of the Hadoop framework. MapReduce
facilitates concurrent processing by splitting petabytes of data into smaller chunks,
and processing them in parallel on Hadoop commodity servers. In the end, it
aggregates all the data from multiple servers to return a consolidated output back to
the application.

For example, a Hadoop cluster with 20,000 inexpensive commodity


servers and 256MB block of data in each, can process around 5TB of
data at the same time. This reduces the processing time as compared
to sequential processing of such a large data set.

With MapReduce, rather than sending data to where the application or


logic resides, the logic is executed on the server where the data already
resides, to expedite processing. Data access and storage is disk-
based—the input is usually stored as files containing structured, semi-
structured, or unstructured data, and the output is also stored in files.

MapReduce was once the only method through which the data stored
in the HDFS could be retrieved, but that is no longer the case. Today,
there are other query-based systems such as Hive and Pig that are
used to retrieve data from the HDFS using SQL-like statements.
How MapReduce Works
At the crux of MapReduce are two functions: Map and Reduce. They are sequenced
one after the other.

The Map function takes input from the disk as <key,value> pairs, processes them,
and produces another set of intermediate <key,value> pairs as output.

The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output

The types of keys and values differ based on the use case. All inputs and outputs are stored
in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce
function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

Mappers and Reducers are the Hadoop servers that run the Map and Reduce functions
respectively. It doesn’t matter if these are the same or different servers.

Map

The input data is first split into smaller blocks. Each block is then assigned to a
mapper for processing.
For example, if a file has 100 records to be processed, 100 mappers can run together
to process one record each. Or maybe 50 mappers can run together to process two
records each. The Hadoop framework decides how many mappers to use, based on
the size of the data to be processed and the memory block available on each mapper
server.

Reduce

After all the mappers complete processing, the framework shuffles and sorts the
results before passing them on to the reducers. A reducer cannot start while a
mapper is still in progress. All the map output values that have the same key are
assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition

There are two intermediate steps between Map and Reduce.

Combine is an optional process. The combiner is a reducer that runs individually on


each mapper server. It reduces the data on each mapper further to a simplified form
before passing it downstream.

This makes shuffling and sorting easier as there is less data to work with. Often, the
combiner class is set to the reducer class itself, due to the cumulative and associative
functions in the reduce function. However, if needed, the combiner can be a separate
class as well.

Partition is the process that translates the <key, value> pairs resulting from mappers
to another set of <key, value> pairs to feed into the reducer. It decides how the data
has to be presented to the reducer and also assigns it to a particular reducer.

The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value. There are as many
partitions as there are reducers. So, once the partitioning is complete, the data from
each partition is sent to a specific reducer.

A MapReduce Example

Consider an ecommerce system that receives a million requests every day to process
payments. There may be several exceptions thrown during these requests such as
"payment declined by a payment gateway," "out of inventory," and "invalid
address." A developer wants to analyze last four days' logs to understand which
exception is thrown how many times.
Example Use Case

The objective is to isolate use cases that are most prone to errors, and to take
appropriate action. For example, if the same payment gateway is frequently throwing
an exception, is it because of an unreliable service or a badly written interface? If the
"out of inventory" exception is thrown often, does it mean the inventory calculation
service has to be improved, or does the inventory stocks need to be increased for
certain products?

The developer can ask relevant questions and determine the right course of action.
To perform this analysis on logs that are bulky, with millions of records,
MapReduce is an apt programming model. Multiple mappers can process these logs
simultaneously: one mapper could process a day's log or a subset of it based on the
log size and the memory block available for processing in the mapper server.

Map

For simplification, let's assume that the Hadoop framework runs just four mappers.
Mapper 1, Mapper 2, Mapper 3, and Mapper 4.

The value input to the mapper is one record of the log file. The key could be a text
string such as "file name + line number." The mapper, then, processes each record of
the log file to produce key value pairs. Here, we will just use a filler for the value as
'1.' The output from the mappers look like this:

Mapper 1 -> <Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C,


1>, <Exception A, 1>

Mapper 2 -> <Exception B, 1>, <Exception B, 1>, <Exception A, 1>, <Exception A,


1>

Mapper 3 -> <Exception A, 1>, <Exception C, 1>, <Exception A, 1>, <Exception B,


1>, <Exception A, 1>

Mapper 4 -> <Exception B, 1>, <Exception C, 1>, <Exception C, 1>, <Exception A,


1>
Assuming that there is a combiner running on each mapper—Combiner 1 …
Combiner 4—that calculates the count of each exception (which is the same function
as the reducer), the input to Combiner 1 will be:

<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>,


<Exception A, 1>

Combine

The output of Combiner 1 will be:

<Exception A, 3>, <Exception B, 1>, <Exception C, 1>

The output from the other combiners will be:

Combiner 2: <Exception A, 2> <Exception B, 2>

Combiner 3: <Exception A, 3> <Exception B, 1> <Exception C, 1>

Combiner 4: <Exception A, 1> <Exception B, 1> <Exception C, 2>

Partition

After this, the partitioner allocates the data from the combiners to the reducers. The
data is also sorted for the reducer.

The input to the reducers will be as below:

Reducer 1: <Exception A> {3,2,3,1}

Reducer 2: <Exception B> {1,2,1,1}

Reducer 3: <Exception C> {1,1,2}

If there were no combiners involved, the input to the reducers will be as below:

Reducer 1: <Exception A> {1,1,1,1,1,1,1,1,1}

Reducer 2: <Exception B> {1,1,1,1,1}

Reducer 3: <Exception C> {1,1,1,1}

Here, the example is a simple one, but when there are terabytes of data involved, the
combiner process’ improvement to the bandwidth is significant.
Reduce

Now, each reducer just calculates the total count of the exceptions as:

Reducer 1: <Exception A, 9>

Reducer 2: <Exception B, 5>

Reducer 3: <Exception C, 4>

The data shows that Exception A is thrown more often than others and requires more
attention. When there are more than a few weeks' or months' of data to be processed
together, the potential of the MapReduce program can be truly exploited.

Java code

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(ExceptionCount.class);


conf.setJobName("exceptioncount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setCombinerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

13.hbase?
What is HBase
Hbase is an open source and sorted map data built on Hadoop. It is
column oriented and horizontally scalable.

It is based on Google's Big Table.It has set of tables which keep data in
key value format. Hbase is well suited for sparse data sets which are very
common in big data use cases. Hbase provides APIs enabling
development in practically any programming language. It is a part of the
Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.

Why HBase
o RDBMS get exponentially slow as the data becomes large
o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values

Features of Hbase
o Horizontally scalable: You can add any number of columns anytime.
o Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the
event of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
o sparse, distributed, persistent, multidimensional sorted map, which is indexed
by rowkey, column key,and timestamp.
o Often referred as a key value store or column family-oriented database, or
storing versioned maps of maps.
o fundamentally, it's a platform for storing and retrieving data with random
access.
o It doesn't care about datatypes(storing an integer in one row and a string in
another for the same column).
o It doesn't enforce relationships within your data.

What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.

14. HIVE IN DETAIL?


Apache Hive is a distributed, fault-tolerant data warehouse system that
enables analytics at a massive scale. A data warehouse provides a
central store of information that can easily be analyzed to make
informed, data driven decisions. Hive allows users to read, write, and
manage petabytes of data using SQL.
Hive is built on top of Apache Hadoop, which is an open-source
framework used to efficiently store and process large datasets. As a
result, Hive is closely integrated with Hadoop, and is designed to work
quickly on petabytes of data. What makes Hive unique is the ability to
query large datasets, leveraging Apache Tez or MapReduce, with a
SQL-like interface.

How does Hive work?


Hive was created to allow non-programmers familiar with SQL to work
with petabytes of data, using a SQL-like interface called HiveQL.
Traditional relational databases are designed for interactive queries on
small to medium datasets and do not process huge datasets well. Hive
instead uses batch processing so that it works quickly across a very
large distributed database. Hive transforms HiveQL queries into
MapReduce or Tez jobs that run on Apache Hadoop’s distributed job
scheduling framework, Yet Another Resource Negotiator (YARN). It
queries data stored in a distributed storage solution, like the Hadoop
Distributed File System (HDFS) or Amazon S3. Hive stores its database
and table metadata in a metastore, which is a database or file backed
store that enables easy data abstraction and discovery.
Hive includes HCatalog, which is a table and storage management layer
that reads data from the Hive metastore to facilitate seamless integration
between Hive, Apache Pig, and MapReduce. By using the metastore,
HCatalog allows Pig and MapReduce to use the same data structures as
Hive, so that the metadata doesn’t have to be redefined for each engine.
Custom applications or third party integrations can use WebHCat, which
is a RESTful API for HCatalog to access and reuse Hive metadata.

Benefits of Hive
FAST
Hive is designed to quickly handle petabytes of data using batch processing.

FAMILIAR
Hive provides a familiar, SQL-like interface that is accessible to non-programmers.

SCALABLE
Hive is easy to distribute and scale based on your needs.

15.PIG IN DETAIL?

Introduction to Apache Pig


Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is
used to process the large datasets. It provides a high-level of abstraction for
processing over the MapReduce. It provides a high-level scripting language, known
as Pig Latin which is used to develop the data analysis codes. First, to process the
data which is stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted
all these scripts into a specific map and reduce task. But these are not visible to the
programmers in order to provide a high-level of abstraction. Pig Latin and Pig
Engine are the two main components of the Apache Pig tool. The result of Pig
always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution
environment in a single JVM (used when dataset is small in size)and distributed
execution environment in a Hadoop Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting the
job and retrieving the output is a time-consuming task. Apache Pig reduces the time
of development using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java code can be
written in only 10 lines using the Pig Latin language. Programmers who have SQL
knowledge needed less effort to learn Pig Latin.
 It uses query approach which results in reducing the length of the code.
 Pig Latin is SQL like language.
 It provides many builtIn operators.
 It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s
researchers. At that time, the main idea to develop Pig was to execute the
MapReduce jobs on extremely large datasets. In the year 2007, it moved to Apache
Software Foundation(ASF) which makes it an open source project. The first
version(0.1) of Pig came in the year 2008. The latest version of Apache Pig
is 0.18 which came in the year 2017.
Features of Apache Pig:
 For performing several operations Apache Pig provides rich sets of
operators like the filtering, joining, sorting, aggregation etc.
 Easy to learn, read and write. Especially for SQL-programmer, Apache Pig
is a boon.
 Apache Pig is extensible so that you can make your own process
and user-defined functions(UDFs) written in python, java or other
programming languages .
 Join operation is easy in Apache Pig.
 Fewer lines of code.
 Apache Pig allows splits in the pipeline.
 By integrating with other components of the Apache Hadoop ecosystem,
such as Apache Hive, Apache Spark, and Apache ZooKeeper, Apache Pig
enables users to take advantage of these components’ capabilities while
transforming data.
 The data structure is multivalued, nested, and richer.
 Pig can handle the analysis of both structured and unstructured data.

16 SUPERVISED AND UNSUPERVISED LEARNING ?


Supervised learning: Supervised learning, as the name indicates, has the presence of
a supervisor as a teacher. Basically supervised learning is when we teach or train the
machine using data that is well-labelled. Which means some data is already tagged
with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set
of training examples) and produces a correct outcome from labeled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now
the first step is to train the machine with all the different fruits one by one like this:

 If the shape of the object is rounded and has a depression at the top, is red
in color, then it will be labeled as –Apple.
 If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana
from the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time
has to use it wisely. It will first classify the fruit with its shape and color and would
confirm the fruit name as BANANA and put it in the Banana category. Thus the
machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
 Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
 Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some
data is already tagged with the correct answer.
Types:-
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
Support Vector Machine
Advantages:-
 Supervised learning allows collecting data and produces data output from
previous experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world
computation problems.
It performs classification and regression tasks.
It allows estimating or mapping the result to a new sample.
We have complete control over choosing the number of classes we want in
the training data.
Disadvantages:-
 Classifying big data can be challenging.
 Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
 Supervised learning cannot handle all complex tasks in Machine Learning.
 Computation time is vast for supervised learning.
 It requires a labelled data set.
 It requires a training process.

Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be
given to the machine. Therefore the machine is restricted to find the hidden structure
in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which it has
never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in them and the second
part may contain all pics having cats in them. Here you didn’t learn anything before,
which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by purchasing
behavior.
 Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Supervised vs. Unsupervised Machine Learning:
Supervised machine Unsupervised machine
Parameters
learning learning

Algorithms are trained using Algorithms are used against data


Input Data
labeled data. that is not labeled

Computational
Simpler method Computationally complex
Complexity

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

K-Means clustering,
Linear and Logistics Hierarchical clustering,
Algorithms used regression, Random forest,
Apriori algorithm, etc.
Support Vector Machine,
Neural Network, etc.

Output Desired output is given. Desired output is not given.

Use training data to infer


Training data No training data is used.
model.

It is not possible to learn


It is possible to learn larger and
larger and more complex
Complex model more complex models with
models than with supervised
unsupervised learning.
learning.

Model We can test our model. We can not test our model.

Supervised learning is also Unsupervised learning is also


Called as
called classification. called clustering.

Example: Optical character Example: Find a face in an


Example
recognition. image.

17.SOCIAL MEDIA ANALATICS?


overview of social media analytics
Practitioners and analysts alike know social media by its many websites and channels:
Facebook, YouTube, Instagram, Twitter, LinkedIn, Reddit and many others.

Social media analytics is the ability to gather and find meaning in data gathered from
social channels to support business decisions — and measure the performance of
actions based on those decisions through social media.

Social media analytics is broader than metrics such as likes, follows, retweets, previews,
clicks, and impressions gathered from individual channels. It also differs from reporting
offered by services that support marketing campaigns such as LinkedIn or Google
Analytics.

Social media analytics uses specifically designed software platforms that work similarly
to web search tools. Data about keywords or topics is retrieved through search queries
or web ‘crawlers’ that span channels. Fragments of text are returned, loaded into a
database, categorized and analyzed to derive meaningful insights.

Social media analytics includes the concept of social listening. Listening is monitoring
social channels for problems and opportunities. Social media analytics tools typically
incorporate listening into more comprehensive reporting that involves listening and
performance analysis.
Why is social media analytics important?

IBM points out that with the prevalence of social media: “News of a great product can
spread like wildfire. And news about a bad product — or a bad experience with a
customer service rep — can spread just as quickly. Consumers are now holding
organizations to account for their brand promises and sharing their experiences with
friends, co-workers and the public at large.”

Social media analytics helps companies address these experiences and use them to:

 Spot trends related to offerings and brands


 Understand conversations — what is being said and how it is being received
 Derive customer sentiment towards products and services
 Gauge response to social media and other communications
 Identify high-value features for a product or service
 Uncover what competitors are saying and its effectiveness
 Map how third-party partners and channels may affect performance

These insights can be used to not only make tactical adjustments, like addressing an
angry tweet, they can help drive strategic decisions. In fact, IBM finds social media
analytics is now “being brought into the core discussions about how businesses develop
their strategies.”

These strategies affect a range of business activity:

 Product development - Analyzing an aggregate of Facebook posts, tweets and


Amazon product reviews can deliver a clearer picture of customer pain points,
shifting needs and desired features. Trends can be identified and tracked to
shape the management of existing product lines as well as guide new product
development.
 Customer experience - An IBM study discovered “organizations are evolving
from product-led to experience-led businesses.” Behavioral analysis can be
applied across social channels to capitalize on micro-moments to delight
customers and increase loyalty and lifetime value.
Branding - Social media may be the world’s largest focus group. Natural language
processing and sentiment analysis can continually monitor positive or negative
expectations to maintain brand health, refine positioning and develop new brand
attributes.
 Competitive Analysis - Understanding what competitors are doing and how
customers are responding is always critical. For example, a competitor may
indicate that they are foregoing a niche market, creating an opportunity. Or a
spike in positive mentions for a new product can alert organizations to market
disruptors.
 Operational efficiency – Deep analysis of social media can help organizations
improve how they gauge demand. Retailers and others can use that information
to manage inventory and suppliers, reduce costs and optimize resources.
Key capabilities of effective social media analytics
The first step for effective social media analytics is developing a goal. Goals can range
from increasing revenue to pinpointing service issues. From there, topics or keywords
can be selected and parameters such as date range can be set. Sources also need to be
specified — responses to YouTube videos, Facebook conversations, Twitter arguments,
Amazon product reviews, comments from news sites. It is important to select sources
pertinent to a given product, service or brand.

Typically, a data set will be established to support the goals, topics, parameters and
sources. Data is retrieved, analyzed and reported through visualizations that make it
easier to understand and manipulate.

These steps are typical of a general social media analytics approach that can be made
more effective by capabilities found in social media analytics platforms.

 Natural language processing and machine learning technologies identify


entities and relationships in unstructured data — information not pre-formatted
to work with data analytics. Virtually all social media content is unstructured.
These technologies are critical to deriving meaningful insights.
 Segmentation is a fundamental need in social media analytics. It categorizes
social media participants by geography, age, gender, marital status, parental
status and other demographics. It can help identify influencers in those
categories. Messages, initiatives and responses can be better tuned and targeted
by understanding who is interacting on key topics.
 Behavior analysis is used to understand the concerns of social media
participants by assigning behavioral types such as user, recommender,
prospective user and detractor. Understanding these roles helps develop
targeted messages and responses to meet, change or deflect their perceptions.
 Sentiment analysis measures the tone and intent of social media comments. It
typically involves natural language processing technologies to help understand
entities and relationships to reveal positive, negative, neutral or ambivalent
attributes.
 Share of voice analyzes prevalence and intensity in conversations regarding
brand, products, services, reputation and more. It helps determine key issues
and important topics. It also helps classify discussions as positive, negative,
neutral or ambivalent.
 Clustering analysis can uncover hidden conversations and unexpected insights.
It makes associations between keywords or phrases that appear together
frequently and derives new topics, issues and opportunities. The people that
make baking soda, for example, discovered new uses and opportunities using
clustering analysis.
 Dashboards and visualization charts, graphs, tables and other presentation
tools summarize and share social media analytics findings — a critical capability
for communicating and acting on what has been learned. They also enable users
to grasp meaning and insights more quickly and look deeper into specific
findings without advanced technical skills.

18.MOBILE ANALATICS?
Mobile analytics captures data from mobile app, website, and web app
visitors to identify unique users, track their journeys, record their
behavior, and report on the app’s performance. Similar to traditional web
analytics, mobile analytics are used to improve conversions, and are the
key to crafting world-class mobile experiences.

Why do companies use mobile


analytics?
Mobile analytics gives companies unparalleled insights into the otherwise hidden
lives of app users. Analytics usually comes in the form of software that integrates
into companies’ existing websites and apps to capture, store, and analyze the
data. This data is vitally important to marketing, sales, and product management
teams who use it to make more informed decisions. Without a mobile analytics
solution, companies are left flying blind. They’re unable to tell what users engage
with, who those users are, what brings them to the site or app, and why they leave.
Companies in this situation must rely on intuition or domain expertise and often
underperform compared to their peers according to Gartner.

Why are mobile analytics important?


Mobile usage surpassed that of desktop in 2015 and smartphones are fast becoming
consumers’ preferred portal to the internet. Consumers spend 70 percent of their
media consumption and screen time on mobile devices, and most of that time in
mobile apps. This is a tremendous opportunity for companies to reach their
consumers, but it’s also a highly saturated market. There are more than 6.5 million
apps in the major mobile app stores, millions of web apps, and more than a billion
websites in existence. Companies use mobile analytics platforms to gain a
competitive edge in building mobile experiences that stand out. Mobile analytics
tools also give teams a much-needed edge in advertising. Mobile advertising now
accounts for nearly 70 percent of all digital advertising according to eMarketer—
some $135 billion—and growing. As more businesses compete for customers on
mobile, teams need to understand how their ads perform in detail, and whether app
users who interact with ads end up purchasing.

70% of ‘screen time’ is spent on


mobile – ComScore
92% of time on mobile is spent
in apps – TechCrunch
How are mobile analytics different
from web analytics?
In the past, companies treated mobile and non-mobile devices as separate, and
even used a separate vendor for their web analytics, but this is becoming a
rarity. Most modern product analytics platforms track users on both mobile and
desktop devices. That said, the physical differences in screen sizes and aspect
ratios lead to slightly different mobile and non-mobile user experiences. On mobile,
users have less screen real estate (4 to 7 inches) and interact by touching, swiping,
and holding. As a result, mobile app and site pages are more spartan with fewer
navigation options. Fonts are larger, and users take relatively fewer actions. On a
desktop, users have larger screens (10 to 17 inches) and interact by clicking, double-
clicking, and using key commands. Desktop tracking generally involves more
interactions, more content, larger menus, and more links per page. A good mobile
analytics platform will account for these device disparities and provide one single,
centralized dashboard that recognizes unique individuals and their behaviors across
devices.

How do mobile analytics work?


Mobile analytics track unique users to record their demographics and behaviors. The
tracking technology varies between websites, which use either JavaScript or cookies
and apps, which require a software development kit (SDK). Each time a website or
app visitor takes an action, the application fires off data which is recorded in the
mobile analytics platform. Mobile analytics typically track:

 Page views
 Visits
 Visitors
 Source data
 Strings of actions
 Location
 Device information
 Login / logout
 Custom event data

Companies use this data to figure out what users want in order to deliver a more
satisfying user experience. For example, they’re able to see:

 What draws visitors to the mobile site or app


 How long visitors typically stay
 What features visitors interact with
 Where visitors encounter problems
 What factors are correlated with outcomes like purchases
 What factors lead to higher usage and long-term retention

With mobile analytics data, product and marketing teams can create positive
feedback loops. As they update their site or app, launch campaigns, and release
new features, they can A/B test the impact of these changes upon their audience.
Based on how audiences respond, teams can make further changes which yield
even more data and lead to more testing. This creates a virtuous cycle which
polishes the product. Mobile apps and sites that undergo this process are far more
effective at serving their user’s needs. A/B test the impact of these changes upon
their audience. Based on how audiences respond, teams can make further changes
which yield even more data and more testing. This creates a virtuous cycle which
polishes the product. Mobile apps and sites and apps that undergo this process are
far more effective at serving their user’s needs.

How different teams use mobile analytics:

 Marketing: Tracks campaign ROI, segments users, automates marketing


 UX/UI: Tracks behaviors, tests features, measures user experience
 Product: Tracks usage, A/B test features, debugs, sets alerts
 Technical teams: Track performance metrics such as app crashes

You might also like