Bda Unit 1
Bda Unit 1
Syllabus:
UNIT I:
Introduction: Introduction to big data: Introduction to Big Data Platform,
Challenges of Conventional Systems, Intelligent data analysis, Nature of Data,
Analytic Processes and Tools, Analysis vs Reporting.
1
1. INTRODUCTION TO BIGDATA PLATFORM
For example: When you visit any website, they might store you IP address, that is data,
in return they might add a cookie in your browser, marking you that you visited the
website, that is data, your name, it's data, your age, it's data.
• What is Data?
– The quantities, characters, or symbols on which operations are performed by a
computer,
– which may be stored and transmitted in the form of electrical signals and
– recorded on magnetic, optical, or mechanical recording media.
• 3 Actions on Data
– Capture
– Transform
– Store
BigData
• Big Data may well be the Next Big Thing in the IT world.
• Big data burst upon the scene in the first decade of the 21st century.
2
• The first organizations to embrace it were online and startup firms.
• Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.
• Like many new information technologies,
• big data can bring about dramatic cost reductions,
• substantial improvements in the time required to perform a computing task, or
• new product and service offerings.
Examples of Bigdata
3
Types Of Big Data
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Developed techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
• Foreseeing issues of today :
– when a size of such data grows to a huge extent, typical sizes are being in the rage
of multiple zetta bytes.
• Do you know?
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
– That is why the name Big Data is given and imagine the challenges involved in
its storage and processing?
• Do you know?
– Data stored in a relational database management system is one example of
a 'structured' data.
4
Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge,
– un-structured data poses multiple challenges in terms of its processing for
deriving value out of it.
– A typical example of unstructured data is
• a heterogeneous data source containing a combination of simple text files,
images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately,
– they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
Semi-structured Data
5
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
6
Storing Big Data
7
– Creating the components of Hadoop MapReduce jobs
– Distributing data processing across server farms
– Executing Hadoop MapReduce jobs
– Monitoring the progress of job flows
• Users
• Application
• Systems
• Sensors
8
Moved to
• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation
9
1.1.2 Basics of Bigdata Platform
Big Data platform is IT solution which combines several Big Data tools and utilities into
one packaged solution for managing and analyzing Big Data.
Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
10
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or
due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate.
Challenges include
Analysis,
Capture,
Data Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Querying,
Updating
Information Privacy.
The term often refers simply to the use of predictive analytics or certain other
advancedmethods to extract value from data, and seldom to a particular size of data set.
ACCURACY in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk.
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target.
Big data requires a set of techniques and technologies with new forms of integration to
reveal insights from datasets that are diverse, complex, and of a massive scale
11
1.1.2.3 List of BigData Platforms
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD
a) Hadoop
What is Hadoop?
Hadoop is open-source, Java based programming framework and server software which
is used to save and analyze data with the help of 100s or even 1000s of commodity
servers in a clustered environment.
Hadoop is designed to storage and process large datasets extremely fast and in fault
tolerant way.
Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no loss
of data even in hardware failure.
Hadoop is Apache sponsored project and it consists of many software packages which
runs on the top of the Apache Hadoop system.
Top Hadoop based Commercial Big Data Analytics Platform
Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system.
Hadoop ecosystem provides necessary tools and software for handling and analyzing
Big Data.
On the top of the Hadoop system many applications can be developed and plugged-in to
provide ideal solution for Big Data needs.
b) Cloudera
12
Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera
Data Science & Engineering and Cloudera Essentials.
All these products are based on the Apache Hadoop and provides real-time processing
and analytics of massive data sets.
Website: https://www.cloudera.com
d) Hortonworks
Hortonworks is using 100% open-source software without any propriety software.
Hortonworks were the one who first integrated support for Apache HCatalog.
The Hortonworks is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop.
Hortonworks Hadoop distribution is 100% open source and its enterprise ready with following
features:
13
e) MapR
MapR is another Big Data platform which us using the Unix file system for handling
data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix
system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data
processing feature.
Website: https://mapr.com
Website: https://www.ibm.com/analytics/us/en/technology/hadoop/
14
g) Microsoft HDInsight
The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial
Big Data platform from Microsoft.
Microsoft is software giant which is into development of windows operating system for
Desktop users and Server users.
This is the big Hadoop distribution offering which runs on the Windows and Azure
environment.
It offer customized, optimized open source Hadoop based analytics clusters which uses
Spark, Hive, MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop
system on windows/Azure environment.
Website: https://azure.microsoft.com/en-in/services/hdinsight/
Website: http://www.intel.com/content/www/us/en/software/intel-distribution-for-apache-
hadoop-software-solutions.html
It provides powerful indexing, search, analytics and graph functionality into the Big
Data system
It supports advanced indexing and searching features
15
15
It comes with powerful integrated analytics system
It provides multi-model support into the platform. It supports key-value, tabular,
JSON/Document and graph data formats. Powerful search features enables the users to
get required data in real-time
Website: http://www.datastax.com/
Teradata
Teradata Aster and
Hadoop
as part of its package solution.
Website: http://www.teradata.com
k) Pivotal HD
Pivotal HD offers is another Hadoop distribution with includes includes database tools
Greenplum and analytics platform Gemfire.
Features:
16
16
Website: https://pivotal.io/
There are various open-source Big Data Platform which can be used for Big Data handling and
data analytics in real-time environment.
Both small and Big Enterprise can use these tools for managing their enterprise data for getting
best value from their enterprise data.
i) Apache Hadoop
Apache Hadoop is Big Data platform and software package which is Apache sponsored
project.
Under Apache Hadoop project various other software is being developed which runs on
the top of Hadoop system to provide enterprise grade data management and analytics
solutions to enterprise.
Apache Hadoop is open-source, distributed file system which provides data processing
and analysis engine for analyzing large set of data.
Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used
on Ubunut and other Linux variants.
ii) MapReduce
The MapReduce engine was originally written by Google and this is the system which
enables the developers to write program which can run in parallel on 100 or even 1000s
of computer nodes to process vast data sets.
After processing all the job on the different nodes it comes the results and return it to the
program which executed the MapReduce job.
This software is platform independent and runs on the top of Hadoop ecosystem. It can
process tremendous data at very high speed in Big Data environment.
iii) GridGain
GridGain is another software system for parallel processing of data just like MapRedue.
GridGain is an alternative of Apache MapReduce.
17
17
GridGain is used for the processing of in-memory data and its is based on Apache Iginte
framework.
GridGain is compatable with the Hadoop HDFS and runs on the top of Hadoop
ecosystem.
Then enterprise version of GridGain can be purchased from official website of
GridGain. While free version can be downloaded from GitHub repository.
Website: https://www.gridgain.com/
v) Apache Storm
Apache Storm is a software for real-time computing and distributed processing.
Its free and open-source software developed at Apache Software foundation. It’s a real-
time, parallel processing engine.
Apache Storm is highly scalable, fault-tolerant which supports almost all the
programming language.
Realtime analytics
Online machine learning
18
18
Continuous computation
Distributed RPC
ETL
And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
viii) SAMOA
SAMOA stands for Scalable Advanced Massive Online Analysis,
It’s a system for mining the Big Data streams.
SAMOA is open-source software distributed at GitHub, which can be used as distributed
machine learning framework also.
Website: https://github.com/yahoo/samoa
Thus, the Big Data industry is growing very fast in 2017 and companies are fast moving
their data to Big Data Platform. There is huge requirement of Big Data in the job market;
many companies are providing training and certifications in Big Data technologies.
*********************
19
Big Data Architecture :
Big data architecture is designed to handle the ingestion, processing, and analysis of data that is too
large or complex for traditional database systems.
1. Ingestion :
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases,
social media, emails, phone calls etc.
There are two kinds of ingestions :
Batch, in which large groups of data are gathered and delivered together.
Streaming, which is a continuous flow of data. This is necessary for real-time
data analytics.
2. Storage :
Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.
The data lake/warehouse is the most essential component of a big data
ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as
possible.
It must be efficient with as little redundancy as possible to allow for quicker
processing.
3. Analysis :
In the analysis layer, data gets passed through several tools, shaping it into
actionable insights.
There are four types of analytics on big data :
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the
data. Veracity is the process of being able to handle and manage data efficiently. Big Data
is also essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store.
It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which
the data is created in real-time. It contains the linking of incoming data sets speeds, rate
of change, and activity bursts. The primary aspect of Big Data is to provide demanding
data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile
devices, etc.
1.2. CHALLENGES OF CONVENTIONAL SYSTEMS
Conventional Systems.
The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
Big data is huge amount of data which is beyond the processing capacity
ofconventional data base systems to manage and analyze the data in a specific time
interval.
The conventional computing functions logically with a set of rules and calculations
while the neural computing can function via images, pictures, and concepts.
Conventional computing is often unable to manage the variability of data obtained in the
real world.
On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This
allows them to excel in those areas that conventional computing often finds difficult.
20
The aggregated or sampled or filtered data. Raw transactional data.
Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only in predictive modeling .
a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.
The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
Because big data is continuously expanding, there are new companies and technologies
that are being developed everyday.
21
A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
While Big Data is a growing field, there are very few experts available in this field.
This is because Big data is a complex field and people who understand the complexity
and intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data platform:
Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
The scale and variety of data that is available today can overwhelm any data practitioner
and that is why it is important to make data accessibility simple and convenient for
brand mangers and owners.
1. Data
2. Process
3. Management
1. Data Challenges
22
Volume
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook,
10 TB.
•Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in
2011.
•The challenge is how to deal with the size of Big Data.
Variety•A lot of this data is unstructured, or has a complex structure that’s hard to represent in
rows and columns.
2. Processing
More than 80% of today’s information isunstructured and it is typically too big to
manage effectively.
Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health
records, emails, etc. The list is endless really.
23
3. Management
A lot of this data is unstructured, or has acomplex structure that’s hard to represent in
rows and columns.
b) Visualization helps organizations perform analyses and make decisions much more
rapidly, but the challenge is going through the sheer volumes of data and accessing the
level of detail needed, all at a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution
is hardware. Some vendors are using increased memory and powerful parallel
processing to crunch large volumes of data extremely quickly
24
d) Understanding the data
It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
visualization as part of data analysis.
Visual analytics enables organizations to take raw data and present it in a meaningful
way that generates the most value. However, when used with big data, visualization is
bound to lead to some challenges.
**************
25
1.3. INTELLIGENT DATA ANALYSIS
Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information.
IDA is
… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
Goal:
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
26
1,3,2 Uses / Benefits of IDA
Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines and the techniques include (but are not limited to):
Data Visualization
Data pre-processing (fusion, editing, transformation, filtering, sampling)
Data Engineering
Database mining techniques, tools and applications
Use of domain knowledge in data analysis
Big Data applications
Evolutionary algorithms
Machine Learning(ML)
Neural nets
Fuzzy logic
Statistical pattern recognition
Knowledge Filtering and
Post-processing
Why IDA?
Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis
27
Sample of examinees died from cardiovascular diseases during the period
2 – they were ill (drug treatment, positive clinical and laboratory findings)
Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer
system.
A Rule :
Example of IDA
2 – they were ill (drug treatment, positive clinical and laboratory findings)
28
Illustration of IDA by using See5
application.names - lists the classes to which cases may belong and the attributes
used to describe each case.
Attributes are of two types: discrete attributes have a value drawn from a set of
possibilities, and continuous attributes have numeric values.
application.data - provides information on the training cases from which See5 will
extract patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
application.data - provides information on the training cases from which See5 will
extract patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
The entry for each case consists of one or more lines that give the values for all
attributes.
Goal 1.1 :
application.names – example
gender:M,F
activity:1,2,3
age: continuous
smoking: No, Yes
…
29
Goal:1,2 :
application.data – example
M,1,59,Yes,0,0,0,0,119,73,103,86,247,87,15979,?,?,?,1,73,2.5
M,1,66,Yes,0,0,0,0,132,81,183,239,?,783,14403,27221,19153,23187,1,73,2.6
M,1,61,No,0,0,0,0,130,79,148,86,209,115,21719,12324,10593,11458,1,74,2.5
… …
Result:
Results – example
30
-> class 2 [0.750]
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.98
Specificity=0.90
31
1. 4 NATURE OF DATA
1.4.1 INTRODUCTION
Data
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
.
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts
used in deciding something. In short, data are meant to be used as a base for arriving at
definitive conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
32
32
e) Aggregation: Aggregation is cumulating or adding up.
f) Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size.Graphs and charts are some examples
of compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined.
In order to understand the nature of data it is necessary to categorize them into various
types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be categorized into
types.
Nominal
Ordinal
Interval
Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
33
The distinction between the four types of scales center on three different characteristics:
34
True Zero: There is no true or real zero. An item, observation, or category cannot finish
zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
35
True Zero: There is a true zero.
Income is a classic example of a ratio scale:
We can convert or transform our data from ratio to interval to ordinal to nominal.
However, we cannot convert or transform our data from nominal to ordinal to interval
to ratio.
36
.Ordinal or ranked data provides comparative Amounts
Example:
1st Place 2nd Place 3rd Place
Example :
60 degrees 12.5 feet 80 Miles per hour
In this case, 93% of all hospital have lower patient satisfaction scores than Eastridge hospital.
31% have lower satisfaction scores than Westridge Hospital.
Thus the nature of data and its value have great influence on data insight in it.
***********************
37
5. ANALYTIC PROCESS AND TOOLS
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
38
Step 1: Deployment
– In this phase,
• Business Understanding
– For the further process, we need to gather initial data, describe and explore the
data and verify data quality to ensure it contains the data we require.
39
– Data collected from the various sources is described in terms of its application
and the need for the project in this phase.
– we need to select data as per the need, clean it, construct it to get useful
information and
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
• we need to
– select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented
on the data in this phase.
40
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
Thus the BDA tools are used through out the BDA applications development.
******************
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
41
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”
42
1.6.3 CONTRAST BETWEEN ANALYSIS AND REPORTING
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible
Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
Reporting and analysis can go hand-in-hand:
Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
Reporting translate a raw data into information
Reporting usually raises a question – What is happening ?
Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing
in the needy context.
*****************
43