BIG DATA
BIG DATA
Volume: Volume refers to the vast amount of data that is generated and collected. Big data
involves working with data sets that are so large and complex that they cannot be easily
managed or processed using traditional database systems. The volume of data can range
from terabytes to petabytes or even larger.
Velocity: Velocity represents the speed at which data is generated, collected, and processed.
Big data often involves real-time or near-real-time data streams that flow rapidly and
continuously. Examples of high-velocity data sources include social media feeds, sensor data
from Internet of Things (IoT) devices, financial transactions, and website clickstreams.
Variety: Variety refers to the diverse types and formats of data that are included in big data.
This includes structured data (e.g., data stored in traditional databases), unstructured data
(e.g., text documents, emails, social media posts), semi-structured data (e.g., XML or JSON
files), multimedia data (e.g., images, videos), and more. Big data systems need to be able to
handle and analyze this wide range of data formats.
Veracity: Veracity relates to the reliability and trustworthiness of the data. Big data often
includes data from various sources, and the quality and accuracy of the data can vary
significantly. Veracity refers to the challenges of dealing with uncertain, incomplete, or
inconsistent data. It involves ensuring data integrity, addressing data quality issues, and
making accurate interpretations despite the presence of noise or errors in the data.
Note: It's worth mentioning that some discussions on big data include additional Vs such as
value and variability, which emphasize the importance of extracting value from data and
handling data changes over time. However, the original concept of the "4 Vs" focuses on
volume, velocity, variety, and veracity.
The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
10. Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the
motivation and goals for carrying out the analysis. In this stage, the problem
is identified, and assumptions are made that how much potential gain a
company will make after carrying out the analysis. Important activities in this
step include framing the business problem as an analytics challenge that can
be addressed in subsequent phases. It helps the decision-makers understand
the business resources that will be required to be utilized thereby determining
the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data
problem or not, based on the business requirements in the business case. To
qualify as a big data problem, the business case should be directly related to
one(or more) of the characteristics of volume, velocity, or variety.
Introduction
You’ve likely heard the terms “Big Data” and “Cloud Computing” before. If you’re
involved with cloud application development, you may even have experience with
them. The two go hand-in-hand, with many public cloud services performing big data
analytics.
With Software as a Service (SaaS) becoming increasingly popular, keeping up-to-
date with cloud infrastructure best practices and the types of data that can be stored
in large quantities is crucial. We’ll take a look at the differences between cloud
computing and big data, the relationship between them, and why the two are a
perfect match, bringing us lots of new, innovative technologies, such as artificial
intelligence.
Before discussing how the two go together, it’s important to form a clear distinction
between “Big Data” and “Cloud Computing”. Although they are technically different
terms, they’re often seen together in literature because they interact synergistically
with one another.
Big Data: This simply refers to the very large sets of data that are
output by a variety of programs. It can refer to any of a large variety
of types of data, and the data sets are usually far too large to peruse
or query on a regular computer.
Cloud Computing: This refers to the processing of anything,
including Big Data Analytics, on the “cloud”. The “cloud” is just a set
of high-powered servers from one of many providers. They can often
view and query large data sets much more quickly than a standard
computer could.
Essentially, “Big Data” refers to the large sets of data collected, while “Cloud
Computing” refers to the mechanism that remotely takes
As you can see, there are infinite possibilities when we combine Big Data and Cloud
Computing! If we simply had Big Data alone, we would have huge data sets that
have a huge amount of potential value just sitting there. Using our computers to
analyze them would be either impossible or impractical due to the amount of time it
would take.
However, Cloud Computing allows us to use state-of-the-art infrastructure and only
pay for the time and power that we use! Cloud application development is also fueled
by Big Data. Without Big Data, there would be far fewer cloud-based applications,
since there wouldn’t be any real necessity for them. Remember, Big Data is often
collected by cloud-based applications, as well!
In short, Cloud Computing services largely exist because of Big Data. Likewise, the
only reason that we collect Big Data is because we have services that are capable of
taking it in and deciphering it, often in a matter of seconds. The two are a perfect
match, since neither would exist without the other!
Conclusion
Finally, it’s important to note that both Big Data and Cloud Computing play a huge
role in our digital society. The two linked together allow people with great ideas but
limited resources a chance at business success. They also allow established
businesses to utilize data that they collect but previously had no way of analyzing.
6.MOBILE BUSINESS INTELLIGENCE IN BIG DATA?
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.
1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.
2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.
3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.
4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.
5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.
Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.
1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.
2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.
3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.
4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.
5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.
Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.
Mobile Business Intelligence (Mobile BI) refers to the use of mobile devices, such as
smartphones and tablets, to access and analyze business intelligence data. When combined
with big data, Mobile BI allows users to access and derive insights from large volumes of
data while on the go.
1. Data Accessibility: Big data platforms and technologies enable the storage and processing
of vast amounts of data. Mobile BI leverages this capability by providing users with remote
access to big data sources through mobile applications. Users can retrieve data from data
warehouses, data lakes, or real-time streaming sources and view it on their mobile devices.
2. Data Visualization: Mobile BI tools provide interactive and visually appealing data
visualizations optimized for smaller screens. These visualizations, such as charts, graphs, and
dashboards, help users understand and analyze complex big data sets. Mobile BI applications
often offer touch-friendly interfaces and intuitive navigation for a seamless user experience.
3. Real-Time Analytics: Big data platforms enable the processing of high-velocity data
streams, including real-time data. Mobile BI applications can tap into these real-time data
sources and provide users with up-to-date insights and analytics on their mobile devices.
This empowers decision-makers to make informed choices based on the latest data,
regardless of their location.
4. Collaboration and Sharing: Mobile BI solutions enable users to collaborate and share
insights with their teams, even when they are not physically present. Team members can
access shared reports, collaborate on data analysis, and provide feedback using their mobile
devices. This fosters data-driven decision-making and improves productivity in a mobile
work environment.
5. Location-Based Analytics: Mobile devices have built-in GPS capabilities, allowing Mobile BI
applications to leverage location data. When combined with big data, location-based
analytics can provide valuable insights into customer behavior, market trends, and
operational efficiency. For example, analyzing location data from mobile devices can help
retailers optimize store layouts, target specific customer segments, and personalize
marketing campaigns.
Overall, Mobile BI in the context of big data enables users to access, analyze, and share
insights from large and complex data sets using their mobile devices. It enhances data-
driven decision-making, facilitates collaboration, and provides real-time access to critical
information, empowering users to make informed choices regardless of their location.
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of
Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short
quick span of time. At such times, HBase comes handy as it gives us a
tolerant way of storing limited data
3. Hadoop File APIs: Hadoop provides programming APIs (such as Java or Python)
to interact with HDFS programmatically. You can use these APIs to read data from
external sources and write it into HDFS or vice versa. For example, you can use the
HDFS API to write data from an application directly to HDFS or read data from
HDFS and process it in your application.
- Apache Sqoop: Sqoop is a tool used for efficiently transferring structured data
between Hadoop and relational databases. It supports importing data from databases
into HDFS or exporting data from HDFS back to databases.
Input in MapReduce:
1. Input Data: The input data for a MapReduce job is typically stored in the Hadoop
Distributed File System (HDFS) or any other compatible file system accessible by
Hadoop. The input data can be a collection of files, directories, or other data sources.
Each input data source is divided into input splits, which are processed by individual
map tasks.
2. Input Format: The input format specifies how the input data is read and split into
input records for processing by the map tasks. Hadoop provides various input
formats for different types of input data, such as TextInputFormat for plain text files,
SequenceFileInputFormat for binary files, or custom input formats tailored for
specific data formats (e.g., CSV, JSON).
3. RecordReader: The RecordReader is responsible for reading and parsing the input
data splits into key-value pairs, known as input records. It defines how to interpret
the data within each input split and generates key-value pairs to be processed by the
map tasks. The RecordReader provides the input records to the map function.
Output in MapReduce:
1. Mapper Output: The map tasks process the input records and generate
intermediate key-value pairs as their output. Each map task can produce multiple
key-value pairs. The output of the map tasks is collected and sorted by the
MapReduce framework based on the keys.
2. Partitioner: The Partitioner determines which reducer task will receive each
intermediate key-value pair. It partitions the map output based on the keys, ensuring
that all key-value pairs with the same key go to the same reducer. The partitioner
uses the hash value of the key to make the partitioning decision.
3. Reducer Input: The intermediate key-value pairs generated by the mappers are
sent to the reducer tasks as their input. Each reducer receives a subset of the
intermediate data based on the partitioning performed by the partitioner. Reducers
process the data and generate the final output.
4. Output Format: The output format defines how the final output of the MapReduce
job is written. Hadoop provides various output formats such as TextOutputFormat
for writing text files, SequenceFileOutputFormat for binary files, or custom output
formats based on specific requirements. The output format determines the
organization and structure of the final output.
It's important to note that the MapReduce framework abstracts away the
complexities of data distribution, parallel processing, and fault tolerance, handling
the distribution of input data to mappers, the collection and sorting of mapper
outputs, and the distribution of intermediate data to reducers automatically.
Developers need to focus on implementing the map and reduce functions, while the
framework takes care of the input and output aspects. In the MapReduce
programming model, the input and output refer to the data flow within the
MapReduce process. Let's understand the input and output components in
MapReduce:
Input in MapReduce:
1. Input Data: The input data for a MapReduce job is typically stored in the Hadoop
Distributed File System (HDFS) or any other compatible file system accessible by
Hadoop. The input data can be a collection of files, directories, or other data sources.
Each input data source is divided into input splits, which are processed by individual
map tasks.
2. Input Format: The input format specifies how the input data is read and split into
input records for processing by the map tasks. Hadoop provides various input
formats for different types of input data, such as TextInputFormat for plain text files,
SequenceFileInputFormat for binary files, or custom input formats tailored for
specific data formats (e.g., CSV, JSON).
3. RecordReader: The RecordReader is responsible for reading and parsing the input
data splits into key-value pairs, known as input records. It defines how to interpret
the data within each input split and generates key-value pairs to be processed by the
map tasks. The RecordReader provides the input records to the map function.
Output in MapReduce:
1. Mapper Output: The map tasks process the input records and generate
intermediate key-value pairs as their output. Each map task can produce multiple
key-value pairs. The output of the map tasks is collected and sorted by the
MapReduce framework based on the keys.
2. Partitioner: The Partitioner determines which reducer task will receive each
intermediate key-value pair. It partitions the map output based on the keys, ensuring
that all key-value pairs with the same key go to the same reducer. The partitioner
uses the hash value of the key to make the partitioning decision.
3. Reducer Input: The intermediate key-value pairs generated by the mappers are
sent to the reducer tasks as their input. Each reducer receives a subset of the
intermediate data based on the partitioning performed by the partitioner. Reducers
process the data and generate the final output.
4. Output Format: The output format defines how the final output of the MapReduce
job is written. Hadoop provides various output formats such as TextOutputFormat
for writing text files, SequenceFileOutputFormat for binary files, or custom output
formats based on specific requirements. The output format determines the
organization and structure of the final output.
It's important to note that the MapReduce framework abstracts away the
complexities of data distribution, parallel processing, and fault tolerance, handling
the distribution of input data to mappers, the collection and sorting of mapper
outputs, and the distribution of intermediate data to reducers automatically.
Developers need to focus on implementing the map and reduce functions, while the
framework takes care of the input and output aspects.
10 HADOOP ARCHTECTURE?
As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today lots of
Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists
of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework. The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster which Makes Hadoop working
so fast. When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is
used as an input to the Reduce function and after that, we receive our final output.
Let’s understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data.
The Input is a set of Data. The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair. These key-value pairs are now sent
as input to the Reduce(). The Reduce() function then combines this broken Tuples or
key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output
Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is utilized
one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them
to the Reducer task is known as Shuffling. Using the Shuffling process the
system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort
of process on those key-value depending on its key element.
OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each record in
a new line, and the key and value in a space-separated manner.
2. HDFS
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is default
and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is this
file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size.
Means 4 blocks are created each of 128MB except the last one. Hadoop doesn’t
know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the
Linux file system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what
makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in File
blocks that the HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example we
have made 4 file blocks which means that 3 Replica or copy of each file block is
made means total of 4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive
system hardware) which can be crashed at any time. We are not using the
supercomputer for our Hadoop setup. That is why we need such a feature in HDFS
which can make copies of that file blocks for backup purposes, this is known as fault
tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization the
data is very much important than the storage so nobody cares for this extra storage.
You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many
Racks . with the help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while performing the read/write
information which reduces the Network Traffic.
HDFS Architecture
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
Hadoop common or Common utilities are nothing but our java library and java files
or we can say the java scripts that we need for all the other components present in a
Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for
running the cluster. Hadoop Common verify that Hardware failure in a Hadoop
cluster is common so it needs to be solved automatically in software by Hadoop
Framework.
What is MapReduce?
MapReduce was once the only method through which the data stored
in the HDFS could be retrieved, but that is no longer the case. Today,
there are other query-based systems such as Hive and Pig that are
used to retrieve data from the HDFS using SQL-like statements.
How MapReduce Works
At the crux of MapReduce are two functions: Map and Reduce. They are sequenced
one after the other.
The Map function takes input from the disk as <key,value> pairs, processes them,
and produces another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output
The types of keys and values differ based on the use case. All inputs and outputs are stored
in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce
function is optional.
Mappers and Reducers are the Hadoop servers that run the Map and Reduce functions
respectively. It doesn’t matter if these are the same or different servers.
Map
The input data is first split into smaller blocks. Each block is then assigned to a
mapper for processing.
For example, if a file has 100 records to be processed, 100 mappers can run together
to process one record each. Or maybe 50 mappers can run together to process two
records each. The Hadoop framework decides how many mappers to use, based on
the size of the data to be processed and the memory block available on each mapper
server.
Reduce
After all the mappers complete processing, the framework shuffles and sorts the
results before passing them on to the reducers. A reducer cannot start while a
mapper is still in progress. All the map output values that have the same key are
assigned to a single reducer, which then aggregates the values for that key.
This makes shuffling and sorting easier as there is less data to work with. Often, the
combiner class is set to the reducer class itself, due to the cumulative and associative
functions in the reduce function. However, if needed, the combiner can be a separate
class as well.
Partition is the process that translates the <key, value> pairs resulting from mappers
to another set of <key, value> pairs to feed into the reducer. It decides how the data
has to be presented to the reducer and also assigns it to a particular reducer.
The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value. There are as many
partitions as there are reducers. So, once the partitioning is complete, the data from
each partition is sent to a specific reducer.
A MapReduce Example
Consider an ecommerce system that receives a million requests every day to process
payments. There may be several exceptions thrown during these requests such as
"payment declined by a payment gateway," "out of inventory," and "invalid
address." A developer wants to analyze last four days' logs to understand which
exception is thrown how many times.
Example Use Case
The objective is to isolate use cases that are most prone to errors, and to take
appropriate action. For example, if the same payment gateway is frequently throwing
an exception, is it because of an unreliable service or a badly written interface? If the
"out of inventory" exception is thrown often, does it mean the inventory calculation
service has to be improved, or does the inventory stocks need to be increased for
certain products?
The developer can ask relevant questions and determine the right course of action.
To perform this analysis on logs that are bulky, with millions of records,
MapReduce is an apt programming model. Multiple mappers can process these logs
simultaneously: one mapper could process a day's log or a subset of it based on the
log size and the memory block available for processing in the mapper server.
Map
For simplification, let's assume that the Hadoop framework runs just four mappers.
Mapper 1, Mapper 2, Mapper 3, and Mapper 4.
The value input to the mapper is one record of the log file. The key could be a text
string such as "file name + line number." The mapper, then, processes each record of
the log file to produce key value pairs. Here, we will just use a filler for the value as
'1.' The output from the mappers look like this:
Combine
Partition
After this, the partitioner allocates the data from the combiners to the reducers. The
data is also sorted for the reducer.
If there were no combiners involved, the input to the reducers will be as below:
Here, the example is a simple one, but when there are terabytes of data involved, the
combiner process’ improvement to the bandwidth is significant.
Reduce
Now, each reducer just calculates the total count of the exceptions as:
The data shows that Exception A is thrown more often than others and requires more
attention. When there are more than a few weeks' or months' of data to be processed
together, the potential of the MapReduce program can be truly exploited.
Java code
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setCombinerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
13.hbase?
What is HBase
Hbase is an open source and sorted map data built on Hadoop. It is
column oriented and horizontally scalable.
It is based on Google's Big Table.It has set of tables which keep data in
key value format. Hbase is well suited for sparse data sets which are very
common in big data use cases. Hbase provides APIs enabling
development in practically any programming language. It is a part of the
Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
Why HBase
o RDBMS get exponentially slow as the data becomes large
o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values
Features of Hbase
o Horizontally scalable: You can add any number of columns anytime.
o Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the
event of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
o sparse, distributed, persistent, multidimensional sorted map, which is indexed
by rowkey, column key,and timestamp.
o Often referred as a key value store or column family-oriented database, or
storing versioned maps of maps.
o fundamentally, it's a platform for storing and retrieving data with random
access.
o It doesn't care about datatypes(storing an integer in one row and a string in
another for the same column).
o It doesn't enforce relationships within your data.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.
Benefits of Hive
FAST
Hive is designed to quickly handle petabytes of data using batch processing.
FAMILIAR
Hive provides a familiar, SQL-like interface that is accessible to non-programmers.
SCALABLE
Hive is easy to distribute and scale based on your needs.
15.PIG IN DETAIL?
If the shape of the object is rounded and has a depression at the top, is red
in color, then it will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana
from the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time
has to use it wisely. It will first classify the fruit with its shape and color and would
confirm the fruit name as BANANA and put it in the Banana category. Thus the
machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some
data is already tagged with the correct answer.
Types:-
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
Support Vector Machine
Advantages:-
Supervised learning allows collecting data and produces data output from
previous experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world
computation problems.
It performs classification and regression tasks.
It allows estimating or mapping the result to a new sample.
We have complete control over choosing the number of classes we want in
the training data.
Disadvantages:-
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Supervised learning cannot handle all complex tasks in Machine Learning.
Computation time is vast for supervised learning.
It requires a labelled data set.
It requires a training process.
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be
given to the machine. Therefore the machine is restricted to find the hidden structure
in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which it has
never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in them and the second
part may contain all pics having cats in them. Here you didn’t learn anything before,
which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by purchasing
behavior.
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Supervised vs. Unsupervised Machine Learning:
Supervised machine Unsupervised machine
Parameters
learning learning
Computational
Simpler method Computationally complex
Complexity
K-Means clustering,
Linear and Logistics Hierarchical clustering,
Algorithms used regression, Random forest,
Apriori algorithm, etc.
Support Vector Machine,
Neural Network, etc.
Model We can test our model. We can not test our model.
Social media analytics is the ability to gather and find meaning in data gathered from
social channels to support business decisions — and measure the performance of
actions based on those decisions through social media.
Social media analytics is broader than metrics such as likes, follows, retweets, previews,
clicks, and impressions gathered from individual channels. It also differs from reporting
offered by services that support marketing campaigns such as LinkedIn or Google
Analytics.
Social media analytics uses specifically designed software platforms that work similarly
to web search tools. Data about keywords or topics is retrieved through search queries
or web ‘crawlers’ that span channels. Fragments of text are returned, loaded into a
database, categorized and analyzed to derive meaningful insights.
Social media analytics includes the concept of social listening. Listening is monitoring
social channels for problems and opportunities. Social media analytics tools typically
incorporate listening into more comprehensive reporting that involves listening and
performance analysis.
Why is social media analytics important?
IBM points out that with the prevalence of social media: “News of a great product can
spread like wildfire. And news about a bad product — or a bad experience with a
customer service rep — can spread just as quickly. Consumers are now holding
organizations to account for their brand promises and sharing their experiences with
friends, co-workers and the public at large.”
Social media analytics helps companies address these experiences and use them to:
These insights can be used to not only make tactical adjustments, like addressing an
angry tweet, they can help drive strategic decisions. In fact, IBM finds social media
analytics is now “being brought into the core discussions about how businesses develop
their strategies.”
Typically, a data set will be established to support the goals, topics, parameters and
sources. Data is retrieved, analyzed and reported through visualizations that make it
easier to understand and manipulate.
These steps are typical of a general social media analytics approach that can be made
more effective by capabilities found in social media analytics platforms.
18.MOBILE ANALATICS?
Mobile analytics captures data from mobile app, website, and web app
visitors to identify unique users, track their journeys, record their
behavior, and report on the app’s performance. Similar to traditional web
analytics, mobile analytics are used to improve conversions, and are the
key to crafting world-class mobile experiences.
Page views
Visits
Visitors
Source data
Strings of actions
Location
Device information
Login / logout
Custom event data
Companies use this data to figure out what users want in order to deliver a more
satisfying user experience. For example, they’re able to see:
With mobile analytics data, product and marketing teams can create positive
feedback loops. As they update their site or app, launch campaigns, and release
new features, they can A/B test the impact of these changes upon their audience.
Based on how audiences respond, teams can make further changes which yield
even more data and lead to more testing. This creates a virtuous cycle which
polishes the product. Mobile apps and sites that undergo this process are far more
effective at serving their user’s needs. A/B test the impact of these changes upon
their audience. Based on how audiences respond, teams can make further changes
which yield even more data and more testing. This creates a virtuous cycle which
polishes the product. Mobile apps and sites and apps that undergo this process are
far more effective at serving their user’s needs.