INTRODUCTION
TO BIG DATA
ANALYTICS
Author : FU
Date : Mar-2022
Objectives
After studying this chapter, the student should be able to
understand the key concepts:
Clarify what is meant by Big Data?
Why advanced analytics are needed?
How Data Science differs from Business Intelligence (BI)?
What new roles are needed for the new Big Data ecosystem?
Content
1. Big Data Overview
2. State of the Practice in Analytics
3. Key Roles for the New Big Data Ecosystem
4. Examples of Big Data Analytics
1. Big Data Overview – Where Big Data comes?
Data created constantly, and at an ever-increasing rate. Mobile phones,
social media, Imaging technologies … create new data, stored some
where for some purpose. Devices and sensors automatically generate
diagnostic information that needs to be stored and processed in real time.
Merely keeping up with this huge influx of data is difficult, more
challenging in analyzing vast amount of data, especially non-conform
traditional data structure
Challenges of data deluge present the opportunity to transform business,
government, science, and everyday life
Several industries led the way in the ability to gather and exploit data:
o Credit card companies
o Mobile phone companies
o For companies such as LinkedIn and Facebook
1. Big Data Overview – What is Big data?
Three attributes stand out as defining Big Data characteristics:
o Huge volume of data
o Complexity of data types and structures (variety of new data sources, formats,
and structures)
o Speed of new data creation and growth
Big Data is sometimes described as having 3 Vs:
o volume,
o variety,
o and velocity
Another definition of Big Data comes from the McKinsey Global
report from 2011:
o Big Data is data whose scale, distribution, diversity, and/or timeliness
require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value.
1. Big Data Overview – What methods and
tools used for Big data analysis?
Cannot be efficiently analyzed using only traditional databases
or methods.
Require new tools and technologies to store, manage, and
realize the benefits
New tools and technologies enable creation, manipulation, and
management of large datasets and the storage environments
McKinsey’s definition of Big Data implies that organizations will
need new data architectures and analytic sandboxes, new
tools, new analytical methods, and an integration of multiple
skills into the new role of the data scientist
1. Big Data Overview – What’s Driving Data
Deluge?
Several sources of the Big Data deluge. The rate of data
creation is accelerating, driven by many of the items in Figure
1-1.
1. Big Data Overview - Fastest-growing sources of
Big Data?
Social media and genetic sequencing are
among the fastest-growing sources of Big Data
o Social media data
2012 Facebook users posted 700 status updates per second
worldwide
Facebook construct social graphs to analyze users data
o Genetic sequencing
Genetic sequencing and human genome mapping provide a detailed
understanding of genetic makeup and lineage
Health care industry is looking toward these advances to help
predict which illnesses a person is likely to get in his lifetime and
take steps to avoid these maladies or reduce their impact through
the use of personalized medicine and treatment
1.1 Data Structures
Big data forms:
structured and non-
structured data
(financial data, text files,
multimedia files, and
genetic mappings)
Most of the Big Data
is unstructured or
semi-structured,
requires different
techniques and tools
to process and
analyze
1.1 Data Structures – Data types
Structured data: Data containing a defined data type,
format, and structure
Semi-structured data: Textual data files with a discernible
pattern that enables parsing (such as Extensible Markup
Language [XML] data files that are self-describing and defined
by an XML schema)
Quasi-structured data: Textual data with erratic data formats
that can be formatted with effort, tools, and time (for instance,
web clickstream data that may contain inconsistencies in data
values and formats)
Unstructured data: Data that has no inherent structure,
which may include text documents, PDFs, images, and video
1.2 Analyst Perspective on Data Repositories
2. State of the Practice in Analytics
Business problems provide many opportunities for
organizations to become more analytical and data driven, as
shown in Table 1-2
2.1 BI versus Data Science - What is BI?
Several ways to compare these groups of analytical techniques
BI tends to provide reports, dashboards, and queries on business
questions (closed-ended and explain current or past behavioral) for the
current period or in the past
BI systems used to answer questions related to quarter-to-date
revenue, progress toward quarterly targets, and understand how much
of a given product was sold in a prior quarter or year
BI provides hindsight and some insight and generally answers
questions related to “when” and “where” events occurred
BI problems tend to require highly structured data organized in rows
and columns for accurate reporting
2.1 BI versus Data Science - What is Data Science?
Use disaggregated data in a more forward-looking,
exploratory way, focusing on analyzing the present
and enabling informed decisions about the future.
Be more exploratory in nature and may use scenario
optimization to deal with more open-ended questions,
focusing on questions related to “how” and “why”
events occur.
Data Science projects tend to use many types of data
sources, including large or unconventional datasets
2.1 BI versus Data Science - Summary
2.2 Analytical Architecture
FIGURE 1-9 Typical analytic architecture
2.3 Drivers of Big Data
Data now comes
from multiple
sources:
o Medical information
o Photos and video
o Video surveillance
o Mobile devices
o Smart devices
o Nontraditional IT
devices
FIGURE 1-10 Data evolution and the rise of Big Data sources
2.4 Emerging Big Data Ecosystem and a New
Approach to Analytics
New ecosystem
takes shape,
there are four
main groups of
players within
this
interconnected
web
o Data devices
o Data collectors
o Data
aggregators
o Data users and FIGURE 1-11 Emerging Big Data ecosystem
3. Key Roles for the New Big Data Ecosystem (1)
FIGURE 1-12 Key
roles of the new
Big Data
ecosystem
3. Key Roles for the New Big Data Ecosystem (2)
Three recurring sets of activities for data scientists
o Reframe business challenges as analytics challenges
o Design, implement, and deploy statistical models and data mining
techniques on Big Data
o Develop insights that lead to actionable recommendations
Required main skills for data scientists
o Quantitative skill
o Technical aptitude
o Skeptical mind-set and critical thinking
o Curious and creative
o Communicative and collaborative:
FIGURE 1-13 Profile of a Data Scientist
4. Examples of Big Data Analytics
An example of this is the U.S. retailer Target. Charles Duhigg’s book The Power
of Habit discusses how Target used Big Data and advanced analytical
methods to drive new revenue. After analyzing consumer purchasing
behavior, Target’s statisticians determined that the retailer made a great deal
of money from three main life-event situations.
o Marriage, when people tend to buy many new products
o Divorce, when people buy new products and change their spending habits
o Pregnancy, when people have many new things to buy and have an urgency to buy them
Hadoop represents another example of Big Data innovation on the IT
infrastructure. Apache Hadoop is an open source framework that allows
companies to process vast amounts of information in a highly parallelized way
Social media e.g. LinkedIn (250 million user accounts) represents a
tremendous opportunity to leverage social and professional interactions to
derive new insights
Summary
Big Data comes from myriad sources, including social media,
sensors, the Internet of Things, video surveillance, and many
sources of data
Organizations evolve their processes and see the opportunities from
Big Data. Move beyond traditional BI activities, such as using data to
populate reports and dashboards, and move toward Data Science-
driven projects that attempt to answer more open-ended and
complex questions.
Big Data presents requires new data architectures, including analytic
sandboxes, new ways of working, and people with new skill sets
Need to build Data Science team. Growing talent gap that makes
finding and hiring data scientists in a timely manner difficult