Unit 1 Bigdata
Unit 1 Bigdata
• Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data.
• Developed techniques for working with such kind of data (where the format is well known in advance)
and also deriving value out of it.
• Foreseeing issues of today :
– when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zetta
bytes.
• Do you know?
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
– That is why the name Big Data is given and imagine the challenges involved in its storage and
processing?
• Do you know?
– Data stored in a relational database management system is one example of a 'structured' data.
• An 'Employee' table in a database is an example of Structured Data:
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge,
– un-structured data poses multiple challenges in terms of its processing for deriving value out of it.
– A typical example of unstructured data is
• a heterogeneous data source containing a combination of simple text files, images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately,
– they don't know how to derive value out of it since this data is in its raw form or unstructured format.
Example of Unstructured data
– The output returned by 'Google Search'
Semi-structured Data
• Semi-structured data can contain both the forms of data.
• Semi-structured data as a structured in form
– but it is actually not defined with e.g. a table definition in relational DBMS.
• Example of semi-structured data is
– a data represented in an XML file.
• Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>
<age>35</age>
</rec>
<rec>
<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
Characteristics of BD OR 3Vs of Big Data
• Three Characteristics of Big Data V3s:
1) Volume
Data quantity
2) Velocity
Data Speed
3) Variety
Data Types
Big data platform is a type of IT solution that combines the features and capabilities of several big
data application and utilities within a single solution.
It is an enterprise class IT platform that enables organization in developing, deploying, operating
and managing a big data infrastructure /environment.
Big Data Platform is integrated IT solution for Big Data management which combines several
software system, software tools and hardware to provide easy to use tools system to enterprises.
It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size and data
volume. Big Data Platform is enterprise class IT solution for developing, deploying and managing Big
Data.
There are several Open source and commercial Big Data Platform in the market with varied features
which can be used in Big Data environment.
Big data platform is a type of IT solution that combines the features and capabilities of several big data
application and utilities within a single solution.
It is an enterprise class IT platform that enables organization in developing, deploying, operating and
managing a big data infrastructure /environment.
Big data platform generally consists of big data storage, servers, database, big data management,
business intelligence and other big data management utilities
It also supports custom development, querying and integration with other systems.
The primary benefit behind a big data platform is to reduce the complexity of multiple vendors/
solutions into a one cohesive solution.
Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the business
requirement. Because business needs can change due to new technologies or due to change in business
process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing applications
are inadequate.
Challenges include
Analysis,
Capture,
Data Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Querying,
Updating
h) Hadoop
i) Cloudera
j) Amazon Web Services
k) Hortonworks
l) MapR
m) IBM Open Platform
n) Microsoft HDInsight
o) Intel Distribution for Apache Hadoop
p) Datastax Enterprise Analytics
q) Teradata Enterprise Access for Hadoop
r) Pivotal HD
a) Hadoop
What is Hadoop?
Hadoop is open-source, Java based programming framework and server software which is used to
save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered
environment.
Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant way.
Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity computers. If any
server goes down it know how to replicate the data and there is no loss of data even in hardware failure.
Hadoop is Apache sponsored project and it consists of many software packages which runs on the top
of the Apache Hadoop system.
Top Hadoop based Commercial Big Data Analytics Platform
Hadoop provides set of tools and software for making the backbone of the Big Data analytics system.
b) Cloudera
Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform offering Big Data
solution.
Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data Science &
Engineering and Cloudera Essentials.
All these products are based on the Apache Hadoop and provides real-time processing and analytics of
massive data sets.
d) Hortonworks
Hortonworks is using 100% open-source software without any propriety software. Hortonworks were
the one who first integrated support for Apache HCatalog.
The Hortonworks is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop.
Website: https://hortonworks.com/
e) MapR
MapR is another Big Data platform which us using the Unix file system for handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing feature.
Website: https://mapr.com
Website: https://www.ibm.com/analytics/us/en/technology/hadoop/
g) Microsoft HDInsight
The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial Big Data
platform from Microsoft.
Microsoft is software giant which is into development of windows operating system for Desktop
users and Server users.
This is the big Hadoop distribution offering which runs on the Windows and Azure environment.
It offer customized, optimized open source Hadoop based analytics clusters which uses Spark, Hive,
MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop system on windows/Azure
environment.
Website: https://azure.microsoft.com/en-in/services/hdinsight/
There are various open-source Big Data Platform which can be used for Big Data handling and
data analytics in real-time environment.
Both small and Big Enterprise can use these tools for managing their enterprise data for getting
best value from their enterprise data.
i) Apache Hadoop
Apache Hadoop is Big Data platform and software package which is Apache sponsored project.
Under Apache Hadoop project various other software is being developed which runs on the top of
Hadoop system to provide enterprise grade data management and analytics solutions to enterprise.
Apache Hadoop is open-source, distributed file system which provides data processing and analysis
engine for analyzing large set of data.
Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used on Ubunut
and other Linux variants.
ii) MapReduce
The MapReduce engine was originally written by Google and this is the system which enables the
developers to write program which can run in parallel on 100 or even 1000s of computer nodes to process
vast data sets.
After processing all the job on the different nodes it comes the results and return it to the program
which executed the MapReduce job.
This software is platform independent and runs on the top of Hadoop ecosystem. It can process
tremendous data at very high speed in Big Data environment.
iii) GridGain
GridGain is another software system for parallel processing of data just like MapRedue. GridGain is
an alternative of Apache MapReduce.
GridGain is used for the processing of in-memory data and its is based on Apache Iginte framework.
GridGain is compatable with the Hadoop HDFS and runs on the top of Hadoop ecosystem.
Then enterprise version of GridGain can be purchased from official website of GridGain. While free
version can be downloaded from GitHub repository.
Website: https://www.gridgain.com/
iv) HPCC Systems
HPCC Systems stands for "high performance computing cluster” and this system is developed by
LexisNexis Risk Solutions.
According to the company this software is much faster than Hadoop and can be used in the cloud
environment.
HPCC Systems is developed in C++ and compiled into binary code for distribution.
HPCC Systems is open-source, massive parallel processing system which is installed in cluster to
process data in real-time.
It requires Linux operating system and runs on the commodity servers connected with high-speed
network.
It is scalable from one node to 1000s of nodes to provide performance and scalability.
Website: https://hpccsystems.com/
v) Apache Storm
Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-memory
processing and analysis of large set of stored in the HDFS.
It stores the data into memory for faster processing.
Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as compared to
the MapRedue.
Apache Spark is here to faster the processing and analysis of big data sets in Big Data environment.
Apache Spark is being adopted very fast by the business to analyze their data set to get real value of
their data.
Website: http://spark.apache.org/
viii) SAMOA
The conventional computing functions logically with a set of rules and calculations while the
neural computing can function via images, pictures, and concepts.
Conventional computing is often unable to manage the variability of data obtained in the real
world.
On the other hand, neural computing, like our own brains, is well suited to situations that have no
clear algorithmic solutions and are able to manage noisy imprecise data. This allows them to excel in
those areas that conventional computing often finds difficult.
The following list of challenges has been dominating in the case Conventional systems in real time
scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
(IDA)?
IDA is
2 – they were ill (drug treatment, positive clinical and laboratory findings)
Intelligent Data Analysis
Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and
integrating knowledge and representing that knowledge in a form that can be used in a
computer system.
Knowledge in a domain can be expressed as a number of rules
A Rule :
A formal way of specifying a recommendation, directive, or strategy, expressed as
"IF premise THEN conclusion" or "IF condition THEN action".
How to discover rules hidden in the data?
NATURE OF DATA
Data
Data is a set of values of qualitative or quantitative variables; restated, pieces of data
are individual pieces of information.
Data is measured, collected and reported, and analyzed, whereupon it can be visualized using
graphs or images.
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f)Compression
g) Refinement
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts used in
deciding something. In short, data are meant to be used as a base for arriving at definitive
conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and refined.
Data so refined can present the essence or derived qualitative value, of the matter.
e) Aggregation: Aggregation is cumulating or adding up.
f)Compression: Large amounts of data are always compressed to make them more meaningful.
Compress data to a manageable size.Graphs and charts are some examples of compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of leading
to conclusions or even generalizations. Conclusions can be drawn only when data are processed
or refined.
Types of Data:
In order to understand the nature of data it is necessary to categorize them into various
types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be categorized into
types.
Nominal
Ordinal
Interval
Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:
CHARACTERISTIC OF DATA
Big Data contains a large amount of data that is not being processed by traditional data storage or the
processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can
handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
The term ‘Big Data’ has been in use since the early 1990s. John R. Mashey is given the credit of
making the term ‘Big Data’ popular [7]. Big Data is not something that is completely new or only
used from last two decades. People have been trying to use data analysis and analytics
The evolution of Big Data can be classified in to 3 phases, where every phase has its own
characteristics and capabilities and has contributed to the contemporary meaning of Big Data.
Phase I: Big Data originate from the domain of database management. It mostly depends on the
storage, extraction ,and optimization of data that is stored in Relational Database Management
Systems (RDBMS).
Database management and data warehousing are the two core components of Big Data in the first
Phase. It gives a foundation to modern data analysis and techniques such as database queries, online
analytical processing and standard reporting tools.
Phase II: From early 2000s, usage of Internet and the Web started offering unique data
collections and data analysis opportunities. Companies such as Yahoo, Amazon and eBay expanded
the online stores and started analyzing customer behavior for personalization. The HTTP-based
content on web massively increased the semi-structured and unstructured data.
Organizations now had to find new approaches and storage solutions to deal with these new data types
and analyze them effectively. In later years the growth of social media data aggravated the need for
tools, technologies and analytics techniques that were able to extract meaningful information out
of this unstructured data.
Phase III: From past decade the large scale usage of smart phones with different internet based
applications give the possibility to analyze behavioral data (such as clicks and search queries)and
also location-based data (GPS-data). Simultaneously, the rise of sensor-based internet- enabled
devices termed as the ‘Internet of Things’ (IoT) is making millions of TVs, thermostats, these new
data sources. This gives origin to other new terms ‘Big Data Analytics’.
wearable’s and even refrigerators to generate zettabytes of data every day. This incredible
growth of ‘Big Data’ now started a race to extract meaningful and valuable information out of
Table 1 gives the summary of the three phases in Big Data
Phase-I
Phase-II
Phase-III
DBMS-based, structured content:
1.RDBMS & data warehousing
2.Extract Transfer Load
3.Onine Analytical Processing
4.Dashboards & scorecards
5.Data mining & statistical
analysis
Web based, unstructured content
1.Infomiation retrieval and
extraction
2.Opinion mining
3.Question answering
4.Web analytics and web
intelligence
5.Social media analytics
6.Social network analysis
7.Spatial-temporal analysis
Mobile and senor based content
1.Location-aware analysis
2.Person-centered analysis
3.Context-relevant analysis
4.Mobile visualization
5.Human-Computer interaction
Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the
business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore
the data and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
ANALYSIS VS REPORTING
Reporting :
Once data is collected, it will be organized using tools such as graphs and tables.
The process of organizing this data is called reporting.
Reporting translates raw data into information.
Reporting helps companies to monitor their online business and be alerted when data falls
outside of expected ranges.
Good reporting should raise questions about the business from its end users.
Analysis :
Analytics is the process of taking the organized data and analyzing it.
This helps users to gain valuable insights on how businesses can improve their performance.
Analysis transforms data and information into insights.
The goal of the analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.
Conclusion :
Reporting shows us “what is happening”.
The analysis focuses on explaining “why it is happening” and “what we can do about it”.
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible
Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
Reporting and analysis can go hand-in-hand:
Reporting provides no limited context about what is happening in the data. Context
is critical to good analysis.
Reporting translate a raw data into information
Reporting usually raises a question – What is happening ?
Analysis transforms the data into insights - Why is it happening ? What you can do
about it?