0% found this document useful (0 votes)
26 views

Big Data Analytics_AAM_Unit 1

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Big Data Analytics_AAM_Unit 1

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 178

Course Outcomes

After completing this course, students should


be able to:
 CO1: Understand the significance, structure

and sources of Big data


 CO2: Asses avenues for analytical scalability.

 CO3: Comprehend stream computing and

applications
 CO4: Apply the different clustering

techniques
 CO5: Use different Frame works and

Visualization techniques
Unit I
Introduction To Big Data: What Is Big Data? Is The
"Big" Part Or The "Data" Art More Important? How Is
Big Data Different? How Is Big Data More Of The
Same? Risks Of Big Data -Why You Need To Tame
Big Data -The Structure Of Big Data- Exploring Big
Data, Most Big Data Doesn't Matter- Filtering Big
Data Effectively -Mixing Big Data With Traditional
Data- The Need For Standards-Today's Big Data Is
Not Tomorrow's Big Data. Web Data: The Original
Big Data -Web Data Overview -What Web Data
Reveals -Web Data In Action? A Cross-Section Of Big
Data Sources And The Value They Hold.
Unit II
Data Analysis: Evolution Of Analytic
Scalability – Convergence – Parallel
Processing Systems – Cloud Computing –
Grid Computing – Map Reduce – Enterprise
Analytic Sand Box – Analytic Data Sets –
Analytic Methods – Analytic Tools – Cognos –
Microstrategy - Pentaho. Analysis
Approaches – Statistical Significance –
Business Approaches – Analytic Innovation –
Traditional Approaches – Iterative
Unit III
Mining Data Streams : Introduction To
Streams Concepts, Stream Data Model And
Architecture, Stream Computing, Sampling
Data In A Stream, Filtering Streams,
Counting Distinct Elements In A Stream,
Estimating Moments, Counting Oneness In A
Window, Decaying Window, Realtime
Analytics Platform(RTAP) Applications, Case
Studies, Real Time Sentiment Analysis,
Stock Market Predictions.
Unit IV
Frequent Itemsets And Clustering :
Mining Frequent Itemsets - Market Based
Model – Apriori Algorithm – Handling Large
Data Sets In Main Memory – Limited Pass
Algorithm – Counting Frequent Itemsets In A
Stream – Clustering Techniques –
Hierarchical – K- Means – Clustering High
Dimensional Data – CLIQUE And PROCLUS –
Frequent Pattern Based Clustering Methods
– Clustering In Non-Euclidean Space –
Clustering For Streams And Parallelism.
Unit V
Frameworks And Visualization :
Mapreduce – Hadoop, Hive, Mapr – Sharding
– Nosql Databases - S3 - Hadoop Distributed
File Systems – Visualizations - Visual Data
Analysis Techniques, Interaction Techniques;
Systems And Applications:
Unit I
Introduction To Big Data
What is Big Data?

According to study reported in literature:

•Every day, we create 2.5 quintillion (1 quintillion is 10 30 ) bytes of


data.
•So much that 90% of the data in the world today has been created
in the last two years alone.
•This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals etc.
According to another study
•From the beginning of recorded time (1990) until 2003, 5 billion
gigabytes of data was created.
•In 2011, the same amount was created every two days
•In 2013, the same amount of data was created every 10 minutes
•In 2015, same or more data (generating) every 10 minutes.
•Advances in communications, computation, and
storage have created huge collections of data, having
information of value to business, science, government
and society.
•Example: Search engine companies such as Google, Yahoo!, and
Microsoft have created an entirely new business by capturing the
information freely available on the World Wide Web and providing it
to people in useful ways. (SOCIAL NETWORKING)
•These companies collect trillions of data every day and provide NEW
SERVICES such as satellite images, driving directions, image retrieval
etc.
• The societal benefits of these services are well appreciated, it has
transformed how people find and make use of information on a daily
basis.
•It can be used in wide variety of areas from business, health care,
scientific, Defence etc.

Example: Health care (AKA HEALTH


INFORMATICS)
•Modern medicine system collects huge amounts of information
about patients through imaging technology (CAT scans, MRI),
genetic analysis (DNA microarrays), and other forms of diagnostic
equipment.

•By applying analytics to data sets for large numbers of patients,


medical researchers are gaining fundamental insights into the
GENETIC AND ENVIRONMENTAL CAUSES OF DISEASES,
and creating more effective means of diagnosis.

•Recently hollywood star underwent surgery to prevent cancer.


[who]
According to McKinsey report published in US

•140,000-190,000 workers with “knowledge of big data analytics”


will be needed in the US alone. (2014)

•Furthermore, 1.5 million managers will need to become data-


literate.

•Many agencies / media houses/ scientific community across the


world have identified Big Data as important research area.
GENESIS………………………The
Beginning
•Like it or not, a massive amount of data will be
coming your way soon.

•Perhaps it has reached you already.

•Perhaps you’ve been wrestling with it for a while—


trying to figure out how to store it for later access,
address its mistakes and imperfections, or classify
it into structured categories.
KNOW DIFFERENCE BETWEEN BIG DATA AND
MANAGMENT

As the author Bill Franks puts,


•There may soon be not only a flood of data, but flood of
books on big data.
•Most of these big-data books will be about the
management of big data:
 How to wrestle it into a database or data warehouse.
 How to structure and categorize unstructured data.
 If you find yourself reading a lot about Hadoop or
MapReduce or various approaches to data
warehousing.
BDM
• BDM is, of course, important work. No matter how much
data you have of whatever quality, it won’t be much good
unless you get it into an environment and format in which it
can be accessed and analyzed.
• BDM alone won’t get you very far. You also have to analyze
and act on it for data of any size to be of value.
• Just as traditional database management tools didn’t
automatically analyze transaction data from traditional
systems, Hadoop and MapReduce won’t automatically
interpret the meaning of data from web sites, gene
mapping, image analysis, or other sources of big data.
WHAT IT MEANS TO US: [APPLICATION]

You receive an EMAIL: It contains an offer for a complete


personal computer system. It seems like the retailer read your mind
since you were exploring computers on their web site just a few hours prior.

As you drive to the store to buy the computer bundle, you get an offer for a
discounted coffee from the coffee shop you are getting ready to drive past.
It says that since you’re in the area, you can get 10% off if you stop by in the
next 20 minutes

As you drink your coffee, you receive an apology from the


manufacturer of a product that you complained about yesterday on your
Facebook page, as well as on the company’s web site. …

Finally, once you get back home, you receive notice of a gadget upgrade
available for purchase in your favorite online video game.

Etc…………..
DATA SOURCES
• Explosion of new and powerful data sources like Facebook,
Twitter, LinkedIn, Youtube etc., contributes immensely to
Bigdata & research.
• Advance Analytics will be of great impact.
• To stay competitive, it is imperative that organizations
aggressively pursue capturing and analyzing these new data
sources to gain the insights that they offer.
• Ignoring big data will put an organization at risk and cause it to
fall behind the competition.
• Analytic professionals have a lot of work to do! It won’t be easy
to incorporate big data alongside all the other data that has
been used for analysis for years.
Big Data?
 500 Million Tweets sent each day!
 More than 4 Million Hours of content uploaded to
Youtube every day!
 3.6 Billion Instagram Likes each day.
 4.3 BILLION Facebook messages posted daily!
 5.75 BILLION Facebook likes every day.
 40 Million Tweets shared each day!
 6 BILLION daily Google Searches!

And don’t think with these increases in social media,


that email is going away any time soon! According to The
Radacati Group, 205 BILLION EMAILS are sent each day
in 2015, and by 2019 that number will increase to 20% to
246 Billion emails each day!
WHAT IS BIG DATA?

•There is no consensus in the marketplace as to how to


define big data!

• Def#1: Big data exceeds the reach of commonly


used hardware environments and software tools to
capture, manage, and process it within a tolerable
elapsed time for its user population.”
[terabytemagazine article]

• Def#2: Big data refers to data sets whose size is


beyond the ability of typical database software tools to
capture, store, manage and analyze.”[McKinseyGlobal
Institute ]

•Def#3 :“big” in big data also refers to several other


characteristics of a big data source. These aspects
Volume:
• The sheer volume of data being stored today is
exploding.
• In the year 2000, 800,000 petabytes (PB) of data were
stored in the world.
• We expect this number to reach 35 zettabytes (ZB) by
2020. Twitter alone generates more than 7 terabytes
(TB) of data every day, Facebook 10 TB etc.
Variety : “Variety Is the Spice of Life”

• The volume associated with the Big Data


phenomena brings along new challenges for data
centres trying to deal with it: its variety.
• With the explosion of sensors, and smart devices, as
well as social collaboration technologies, data in an
enterprise has become complex, because it includes
not only traditional relational data

• But also raw, semi structured, and unstructured data


from web pages, web log files (including click-stream
data), search indexes, social media forums, e-mail,
documents, sensor data from active and passive
systems, and so on.
Velocity : How Fast Is Fast?
•The speed at which the data is flowing.
•Increase in RFID sensors and other information
streams has led to a constant flow of data at a pace
that has made it impossible for traditional systems to
handle
•Competition can mean identifying a trend, problem, or
opportunity only seconds, or even microseconds, before
someone else.

•In traditional processing, you can think of running


queries against relatively static data
•For example, the query “Show me all people living
in the City X” would result in a single result set to
be used as a warning list of an incoming weather
pattern.

•With streams computing [IBM], you can execute a


process similar to a continuous query that identifies
people who are currently “CITY X,” but you get
continuously updated results, because location
information from GPS data is refreshed in real time.

•Big Data requires that you perform analytics


against the volume and variety of data while it is
still in motion, not just after it is at rest.
Veracity: (Non reliable Data)

•There is volume, velocity and variety


• There is Big data Hype, also there is non-reliability
with data
• How effective will these data be?
• Example: Product Branding, Image Branding, Image
assignation

In addition a couple of V’s are also suggested:


What’s Big Data?
No single definition; here is from Wikipedia:

 Big data is the term for a collection of data sets so large


and complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
 The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with
the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of
research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions.”
Big Data: 3V’s
Volume (Scale)

 Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
 Data volume is increasing
exponentially

Exponential increase in
collected/generated data
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world
wide

100s of
millions
of GPS
data every
of

enable
? TBs
day

d devices
sold
annually
25+ TBs of
log data 2+
every day billion
people
on the
76 million smart Web by
meters in 2009… end
200M by 2014 2011
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB
The Earthscope
• The Earthscope is the world's
largest science project. Designed
to track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scienc
e-future_of_technology/
#.TmetOdQ--uI)
Variety (Complexity)
 Relational Data (Tables/Transaction/Legacy
Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
◦ Social Network, Semantic Web (RDF), …

 Streaming Data
◦ You can only scan the data once

 A single application can be


generating/collecting many types of data

 Big Public Data (online, weather, finance,


etc)
To extract knowledge all these
types of data need to linked
together
A Single View to the Customer

Banki
Social ng
Media Financ
e

Our
Know

Customer
Gami
n
ng
Histor
y

Entertai Purcha
n se
Velocity (Speed)

 Data is begin generated fast and need to be


processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
◦ E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right now
for store next to you

◦ Healthcare monitoring: sensors monitoring your activities


and body  any abnormal measurements require immediate
reaction
Real-time/Fast Data

Mobile devices
(tracking all objects all the time

Social media and networksScientific instruments


(all of us are generating data)(collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data)
 The progress and innovation is no longer hindered by the ability to collect data
 But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Real-Time Analytics/Decision
Requirement

Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Variability :
•It is often confused with variety.
Example:
•Say you have bakery that sells 10 different breads.
That is variety. Now imagine you go to that bakery
three days in a row and every day you buy the same
type of bread but each day it tastes and smells
different.
•Variability is thus very relevant in performing
sentiment analyses.
•Variability means that the meaning
is changing (rapidly).
Some Make it 4V’s
Visualization

•This is the hard part of big data.


•Making all that vast amount of data comprehensible in
a manner that is easy to understand and read.
•It does not mean ordinary graphs or pie charts. They
mean complex graphs that can include many variables
of data while still remaining understandable and
readable.
•Telling a complex story in a graph is very difficult but
also extremely crucial.
•Luckily there are more and more big data startups
appearing that focus on this aspect and in the end,
VALUE

•Data in itself is not valuable at all.

•The value is in the analyses done on that data and


how the data is turned into information and
eventually turning it into knowledge.

•The value is in how organisations will use that data


and turn their organisation into an information-
centric company that relies on insights derived
from data analyses for their decision-making.
IS THE “BIG” PART OR THE “DATA” PART MORE
IMPORTANT?
•What is the most important part of the term big data?
Is it (1) the “big” part, (2) the “data” part, (3) both, or
(4) neither?

•As with any source of data, big or small, the power of


big data comes :
++ What is done with that data?
++ How is it analyzed?
++ What actions are taken based on the
findings?
++ How is the data used to make changes to a
business?

•People are led to believe that just because big data


has high volume, velocity, and variety, it is somehow
better or more important than other data.
•Many big data sources have a far higher percentage
of useless or low-value content than virtually any
other data source.

•By the time, big data is trimmed down to what you


actually need, it may not even be so big any more.

In Summary:
•Whether it stays big or whether it ends up being
small when you’re done processing it,

•the size isn’t important.

•It’s what you do with it.


HOW IS BIG DATA DIFFERENT?
Majority of big data sources have the following feature:

1. Big data is often automatically generated by a


machine.

• Instead of a person being involved in creating new


data, it’s generated purely by machines in an
automated way. If you think about traditional data
sources, there was always a person involved.

• For example: Consider retail or bank transactions,


telephone call detail records, product shipments, or
invoice payments. All of those involve a person
doing something in order for a data record to be
generated.
2.Big data is typically an entirely new source of data. It
is not simply an extended collection of existing data.

• For Example, with the use of the Internet, customers


can now execute a transaction with a bank or
retailer online. But the transactions they execute are
not fundamentally different transactions from what
they would have done traditionally.

• They’ve simply executed the transactions through a


different channel.

• An organization may capture web transactions, but


they are really just more of the same old
transactions that have been captured for years.

• However, capturing browsing behaviors as


customers execute a transaction creates
3.Many big data sources are not designed to be
friendly. In fact, some of the sources aren’t designed at
all!

• Example: Text streams from a social media site.


(There is no way to ask users to follow certain
standards of grammar, or sentence ordering, or
vocabulary)

• It will be difficult to work with such data at best and


very, very ugly at worst.

• Most traditional data sources were designed up-


front to be friendly.

• Systems used to capture transactions provide data


in a clean, preformatted template that makes the
data easy to load and use
4. Substantial amount of big data streams may not have
much value. In fact, much of the data may even be close
to worthless.

• Example: Within a web log, there are information that


is very powerful. There is also a lot of information that
doesn’t have much value at all. (pic)

• It is necessary to weed through and pull out the


valuable and relevant pieces

• Traditional data sources were defined up-front to be


100 percent relevant.
Example: Weblog
(1)
Example: Weblog (2)
HOW IS BIG DATA MORE OF THE SAME?

•Same thing that existed in the past; is out in a new


form.

• In many ways, big data doesn’t pose any problems


that your organization hasn’t faced before.

•Taming new, large data sources that push the current


limits of scalability is an ongoing theme in the world of
analytics

Fig: Data Mining


Process
RISKS OF BIG DATA
1. An organization will be so overwhelmed with big
data that it won’t make any progress.

[The key here is to get the right people. You need the
right people attacking big data and attempting to solve
the right kinds of problems]

2. cost escalates too fast as too much big data is


captured before an organization knows what to do
with it.

[It is not necessary to go for it all at once and capture


100 percent of every new data source.
What is necessary is to start capturing samples of the
new data sources to learn about them. Using those initial
samples, experimental analysis can be performed to
determine what is truly important within each source
3. Perhaps the biggest risk with many sources of big data
is privacy.
• If everyone in the world was good and honest,
then we wouldn’t have to worry much about privacy
• There have also been high-profile cases of major
organizations getting into trouble for having
ambiguous or poorly defined privacy policies
Example: In April 2013, Living Social, a daily-deals site
partly owned by Amazon, announced that the names,
email addresses, birth dates and encrypted passwords of
more than 50 million customers worldwide had been
stolen by hackers.
•This has led to data being used in ways that consumers
didn’t understand or support, causing a backlash

•Organizations should explain how they will keep data


secure and how they will use it, if they accept their data
to be captured and analyzed
WHY YOU NEED TO TAME BIG DATA

•Many organizations have done little with big data.

•Ecommerce industries have started, where analyzing


big data is already a standard.

•Today, they have a chance to get ahead of the pack.

•Within a few years, any organization that isn’t


analyzing big data will be late to the game and will be
stuck playing catch up for years to come.

•The time to start taming big data is now.


THE STRUCTURE OF BIG DATA

•Big data is often described as Unstructured

•Most traditional data sources are fully structured


realm (sources)

•Data is in pre-defined format and no variation of the


format on day to day or update to update basis.

•Unstructured Data

•Semi Structures Data

• Example : Web logs


What is the difference
between Data Mining and
Web Mining?
Machine Learning : Classification,
Clustering etc.

Semantic approach: Statistics, NLP etc.


FILTERING BIG DATA EFFECTIVELY
•The biggest challenge with big data may not be the
analytics you do with it, but the extract, transform, and
load (ETL) processes you have to build to get it ready
for analysis. (PART OF 90 %)
•Analytic processes may require filters on the front end
to remove portions of a big data stream when it first
arrives. Also there will be other filters along the way as
the data is processed.
•For example, when working with a web log, a rule
might be to filter out up front any information on
browser versions or operating systems. Such data is
rarely needed except for operational reasons.
•Later in the process, the data may be filtered to
specific pages or user actions that need to be examined
for the business issues to be addressed.
Example-1
<HTML>
<TITLE>
<BODY>
Sachin is a former Indian cricketer and captain, widely regarded
as one of the greatest batsmen of all time. Sachin took up
cricket at the age of eleven, made his Test debut on 15
November 1989 against Pakistan in Karachi at the age of
sixteen, and went on to represent Mumbai domestically and
India internationally for close to twenty-four years. Sachin is the
only player to have scored one hundred international centuries,
the first batsman to score a double century in a
One Day International, the holder of the record for the number
of runs in both ODI and Test cricket, and the only player to
complete more than 30,000 runs in international cricket
</BODY>
</TITLE>
</HTML>
Example 2 :Opinion Analysis
Step 1: Sample text
excellent phone, excellent service . i am a
business user who heavily depend on mobile
service ….,,, there is much which has been said
in other reviews about the features of this
phone.
Step 2: Remove delimiters from input file
excellent phone excellent service i am a
business user who heavily depend on mobile
Step 3: Subject the text to parts of speech
tagger
Example: JJ excellent NN phone JJ excellent NN
service FW i VBP am DT a NN business NN
user WP who RB heavily VBP depend IN on JJ
mobile NN service EX there VBZ is JJ much
WDT which VBZ has VBN been VBN said IN in JJ
other NNS reviews IN about DT the NNS
features IN of DT this NN phone
Step 4: Extract feature
Step 4: Approaches
•Supervised approach
•Unsupervised approach

Step 5: Results:
• Positive opinion
• Negative opinion
•The complexity of the rules and the magnitude of the
data being
removed or kept at each stage will vary by data source
and by business problem.
•The load processes and filters that are put on top of
big data are absolutely critical. Without getting those
correct, it will be very difficult to succeed.
•Traditional structured data doesn’t require as much
effort in these areas since it is specified, understood,
and standardized in advance.
•With big data, it is necessary to specify, understand,
and standardize it as part of the analysis process in
many cases.

Example: Application of Filtering to websites to derive


knowledge
MIXING BIG DATA WITH TRADITIONAL DATA

•Perhaps the most exciting thing about big data isn’t


what it will do for a business by itself. It’s what it will
do for a business when combined with an
organization’s other data.
Example:
1. Browsing history, for example, is very powerful.
[Knowing how valuable a customer is and what they
have bought in the past across all channels makes
web data even more powerful by putting it in a larger
context].

2. Smart-grid data is very powerful for a utility


company. [Knowing the historical billing patterns of
customers, their dwelling type, and other factors
makes data from a smart meter even more powerful
by putting it in a larger context.]
3. The text from customer service online chats and e-
mails is powerful.
[Knowing the detailed product specifications of the
products being
discussed, the sales data related to those products, and
historical product defect information makes that text
data even more powerful by putting it in a larger
context.] - Amazon Recommendation system
4.Enterprise Data Warehouses (EDWs) have become
such a widespread corporate tool not just to centralize a
bunch of data marts to save hardware and software
costs.
•An EDW adds value by allowing different data sources to
intermix and enhance one another.
•With an EDW, it is possible to analyze customer and
employee data
together since they are in one location. They are no
longer completely
•This is why it is critically important that organizations
don’t develop a big data strategy that is distinct from
their traditional data strategy.

To succeed, it is necessary to plan not just how to


capture and analyze big data by itself, but also how to
use it in combination with other corporate data.
a. Data Mart

b. Data
Warehouse
Hierarchy of Enterprise Data
HE NEED FOR STANDARDS
•Will big data continue to be a wild west of crazy
formats, unconstrained streams, and lack of definition?

•Probably not. Over time, standards will be developed.

•Many semi-structured data sources will become more


structured over time, and individual organizations will
fine-tune their big data feeds to be friendlier for
analysis.

•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability
across distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data
TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA

•There is no specific, universal definition in terms of


what qualifies as big data.

•Rather, big data is defined in relative terms tied to


available technology and resources.

•As a result, what counts as big data to one company


or industry may not count as big data to another.

•A large e-commerce company is going to have a


much “bigger”
definition of big data than a small manufacturer will.

•What qualifies as big data will necessarily change


over time as the tools and techniques to handle it
evolve alongside raw storage size and processing
•Household demographic (population) files with hundreds
of fields and millions of customers were huge and tough
to manage a decade or two ago.

•Now such data fits on a thumb drive and can be


analyzed by a low-end laptop.

•Transactional data in the retail, telecommunications,


and banking industries were very big and hard to handle
even a decade ago.
•What we are intimidated by today won’t be so scary a
few years down the road.

Example 1:

• Clickstream data from the web may be a standard,


easily handled data source in 10 years
Click Stream :Trail left by users as they click their way
through a website.

Click-path optimization – Using clickstream analysis,


businesses can collect and analyze data to see which pages web
visitors are visiting and in what order.

Market basket analysis – The benefit of basket analysis for


marketers is that it can give them a better understanding of
aggregate customer purchasing behavior

Next Best Product analysis :helps marketers see what


products customers tend to buy together.

Website resource allocation: Clickstream data analysis tells


marketers which paths on the site are hot and which ones are
not.

Customization: personalize the user experience and convert


more web visitors from browsers to buyers.
2. Actively processing every e-mail, customer service
chat, and social media comment may become a
standard practice for most organizations.

As we tame the current generation of big data streams,


other even bigger data sources are going to come along
and take their place.

1. Imagine web browsing data that expands to


include millisecond-level eyeball and mouse
movement so that every tiny detail of a user’s
navigation is captured, instead of just what was
clicked on. This is another order of big.
2. Imagine video game telemetry data being
upgraded to go beyond every button pressed or
movement made

3. Imagine RFID (radio frequency identification)


information being available for every single individual
item in every single store, distribution facility, and
manufacturing plant globally.

4. Imagine capturing and translating to text every


conversation anyone has with a customer service or
sales line. Add to that all the associated e-mails,
Web Data: The Original Big
Data
•Wouldn’t

1. it be great to understand customer intent instead


of just customer action?

2. it be great to understand each customer’s thought


processes to determine whether they make a
purchase or not?
•Virtually impossible to get insights into such topics in
the past
•Today, such topics can be addressed with the use of
detailed web data.
•Organizations across a number of industries have
integrated detailed, customer-level behavioral data
sourced from a web site into their enterprise analytics
environments.
•However, for most organizations web integration
mean inclusion of online transactions.

•Traditional web analytics vendors provide


operational reporting (every day task) on click-
through rates, traffic sources, and metrics based
only on web data.

•However, detailed web behavior data was not


historically leveraged outside of web reporting.

Is it possible to understand Users


Better? How
WEB DATA OVERVIEW

•Organizations have talked about a 360-degree view of


their customers for years.
•What it really meant is that the organization has as full
a view of its customers as possible considering the
technology and data available at that point in time.

•However, the finish line is always moving. Just when


you think you have finally arrived, the finish line moves
farther out again.
•A few decades ago, companies were at the top of their
game if they had the names and addresses of their
customers and they were able to append demographic
information(location & population) to those names
through the then-new third party data enhancement
services.
•Eventually, cutting-edge companies started to have
basic recency, frequency, and monetary value (RFM)
metrics attached to customers. Such metrics look at
when a customer last purchased (recency), how often
they have purchased (frequency), and how much they
spent (monetary value).
•In the past 10 to 15 years, virtually all businesses
started to collect and analyze the detailed transaction
histories of their customers.
•This led to an explosion of analytical power and a much
deeper understanding of customer behavior.
•Many organizations are still frozen at the transactional
history stage.

•Today, while this transactional view is still important,


many companies incorrectly assume that it remains the
closest view possible to a 360-degree view of their
customers.

•Today, organizations need to collect from newly


evolving big data sources related to their customers
from a variety of extended and newly emerging touch
points such as web browsers, mobile applications,
kiosks, social media sites, and more.

•Just as transactional data enabled a revolution in power


of computation and depth of analysis, so too do these
new data sources enable taking analytics to a new level.
What Are You Missing?(with Traditional Data)
•Have you ever stopped to think about what happens if
only the transactions generated by a web site are
captured?
Study Reveals: 95 percent of browsing sessions do not
result in a basket being created. Of that 5 percent, only
about half, or 2.5 percent, actually begin the check out
process. And, of that 2.5 percent only two-thirds, or 1.7
percent, actually complete a purchase.

•What this means is that information is missing on more


than 98 percent of web sessions, if only transactions
are tracked.
•For every purchase transaction, there might be dozens
or hundreds of specific actions taken on the site to get
to that sale. That information needs to be collected and
analyzed alongside the final sales data.
Imagine the Possibilities (Organizations are trying to
know)
•Imagine knowing everything customers do as they go
through the process of doing business with your
organization.
•Not just what they buy, but what they are thinking
about buying along with what key decision criteria they
use.
•Such knowledge enables a new level of understanding
about your customers and a new level of interaction with
your customers.
Example:
1. Imagine you are a retailer. Imagine walking through
with customers and recording every place they go, every
item they look at, every item they pick up, every item
they put in the cart and back out. Imagine knowing
whether they read nutritional information, if they look at
laundry instructions, if they read the promotional
brochure on the shelf, or if they look at other information
2.Imagine you are a telecom company. Imagine being
able to identify every phone model, rate plan, data
plan, and accessory that customers considered before
making a final decision.

What is the difference between


Traditional Analytics and New
scalable Analytics ?
What Data Should Be Collected and from
where?
•Any action that a customer takes while interacting with
an organization should be captured if it is possible to
capture it from web sites, kiosks, social media, mobile
apps etc

•Wide range of events can be captured like: Purchases


Requesting, Product views, Forwarding a link , Shopping
basket additions, Posting a comment, Watching a video,
Registering for a webinar, Accessing a download,
Executing a search, Reading / writing a review etc.
What about privacy ? (How Flip kart is handling this?)

•Privacy is a big issue today and may become an even


bigger issue as time passes.
•Need to respect not just formal legal restrictions, but
also what your customers will view as appropriate.

•Faceless Customer: (identify of customer masked in


data stores)
An arbitrary identification number that is not
personally identifiable can be matched to each unique
customer based on a logon, cookie, or similar piece of
information. This creates what might be called a
“faceless” customer record.
•It is the patterns across faceless customers that
matter, not the behavior of any specific customer
•With today’s database technologies, it is possible to
enable analytic professionals to do analysis without
having any ability to identify the individuals involved.

•This can remove many privacy concerns.

Many organizations are in fact identifying and


targeting specific customers as a result of such
analytics.

Organizations have presumably put in place privacy


policies, including opt-out options, and are careful to
follow them.
What Web Data Reveals
1. Shopping Behaviors:
A good starting point to understand shopping behavior
is identifying:
•How customers come to a site, begin shopping and
their page navigation.
•What search engine do they use?
•What specific search terms are entered?
•Do they use a bookmark they created previously?
•Analytic professionals can take this information and
look for patterns in terms of which search terms, search
engines, and referring sites are associated with higher
•One very capability of web data is to identify product
set that are of interest to a customer before they make
a purchase.

•For example, consider a customer who views


computers, backup disks, printers, and monitors. It is
likely the customer is considering a complete PC system
upgrade.

•Offer a package right away that contains the specific


mix of items the customer has browsed.

•Do not wait until after customers purchase the


computer and then offer generic bundles of accessories.

•A customized bundle offer is more powerful than a


generic one . [study says]
2. Customer Purchase Paths and Preferences
• it is possible to explore and identify the ways
customers arrive at their buying decisions by watching
how they navigate a site.

•It is also possible to gain insight into their preferences.


Consider for example an airline
•An airline can tell a number of things about
preferences based on the ticket that is booked.
•For example, 1.How far in advance was the ticket
booked?
2.What fare class was booked?
3.Did the trip span a weekend or not?

•This is all useful, but an airline can get even more from
•An airline can identify customers who value
convenience (Such customers typically start searches for
specific times and direct flights only.)
•Airlines can also identify customers who value price first
and foremost and are willing to consider many flight
options to get the best price.

•Based on search patterns, airlines can also tell whether


customer value deals or specific destinations.
•Example : Do the customer research all of the special
deals available and then choose one for the trip? Or does
the customer look at a certain destination and pay what
is required to get there?
•For example, a college student may be open to any
number of vacation destinations and will take the one
with the best deal. On the other hand, a customer who
visits family on a regular basis will only be interested in
flying to where the family is.
3. Research Behaviors
•Understanding how customers utilize the research
content on a site can lead to tremendous insights into
how to interact with each individual customer, as well as
how different aspects of the site do or do not add value in
driving sales.
For example, consider an online store selling
cloths: Saree, Zovi Shirts
•Another way to use web data to understand customers’
research patterns: is to identify which of the pieces of
information offered on a site are valued by the customer
base overall and the best customers specifically.
•How often do customers look at a previews( glance),
additional photos( thumb nails/ regular), or technical
specs or reviews before making a purchase?
•Sessions data with other data will help to know when did
the customers buy, on the same day or next day.
Feedback Behaviors

•Where are the Feed back expressed?

•Is it relevant? Baised?

•Does it matter?
Web Data in Action
•What an organization knows about its customers is
never the complete picture.

•It is always necessary to make assumptions based on


the information available.

•If there is only a partial view, the full view can often be
extrapolated accurately enough to get the job done.

•it is also possible that the information missing, paints


a totally different picture than expected.

•In the cases where the missing information differs


from the assumptions, it is possible to make
suboptimal, if not totally wrong, decisions.
•A very common marketing EXAMPLE is to predict what is
the next best offer customer. Of all the available options,
which single offer should next be suggested to a customer
to maximize the chances of success?

•Web behaviour data can help ?

Case 1: BANK
• Mr.Kumar has an account with
PNB………………………………….etc. with relevant
information.

•What is the best offer you can send via email

•Does it ever occur to provide promotional offer on


Mortgage or Housing loan ? With web data, Bank now
know what to discuss with Mr. Kumar
Case 2: Dominos
•Traditional data they get is:
• Historical purchases
• Marketing campaign and response history
•With web data:
• The effort leads to major changes in the promotional
efforts versus the traditional approach, providing the
following results:
• A decrease in total mailings
• A reduction in total catalog promotions pages
• A materially significant increase in total revenues
• Question: With An Example, Justify How Web Data
Contributes To Better Promotional Benefits As Against
Traditional Data?
Attrition Modelling
•In telecommunication sector (example) , companies
have invested massive amounts of time and effort to
create, enhance, and perfect “churn” models. (Trying to
identify leaving customers)
•Churn models flag those customers most at risk of
cancelling their accounts so that action can be taken
proactively to prevent them from doing so.

•Management of customer churn has been, and remains,


critical to understanding patterns of customer usage and
profitability.

Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”,
goes to Google and types “How do I cancel my Provider
AIR contract?” (Web Data).
• Company Analysts, perhaps not, would have seen her
usage dropping.

•It would take weeks to months to identify such a


change in usage pattern anyway.

•By capturing Mrs. Smith’s actions on the web, Provider


“AIR”, is able to move more quickly to avert losing Mrs.
Smith.
Response Modelling
•Many models are created to help predict the choice a
customer will make when presented with a (Data set)
request for action.

•Models typically try to predict which customers will


make a purchase, or accept an offer, or click on an e-
mail link.

•For such models, a technique called logistic regression


is often used. These models are usually referred to as
response models or propensity models.

• The main difference between this and attrition model?


predicting negative behaviour (churn model), predicting
positive behaviour (purchase or response model).
WORKING
•When using a response or propensity model, all
customers are scored and ranked by likelihood of taking
action.
•Then, appropriate segments (groups) are created based
on those ranks in order to reach out to the customers.

•In theory, every customer has a unique score. In


practice, since only a small number of variables define
most models, many customers end up with identical or
nearly identical scores.

•Example: Customers who are not very frequent or high-


spending.

•In many cases, many customers can end up in big


groups with very similar/ very low scores.
•Web data can help greatly increase differentiation
among customers.
For Example, consider a scenario: (score can increase or
decrease by delta x)
•Customer 1 has never browsed your site
•Customer 2 viewed the product category featured in the
offer within
the past month.
•Customer 3 viewed the specific product featured in the
offer within
the past month.
•Customer 4 browsed the specific product featured three
• When asked about the value of incorporating web
data, a director of marketing from a multichannel
American specialty retailer replied, “It’s like printing
money!”
Customer Segmentation (Grouping): Study

•What is segmentation?
•How Segmentation were done traditionally?
•Web data also enables segmentation of customers
based on their typical browsing patterns.
(Seminar/Project topic on assessing browsing pattern of
users)
•Such segmentation will provide a completely different
view of customers than traditional demographic or sales-
based segmentation schemas.

•Assignment: To create dreamers segment and identify


Example:
•Consider a segment called the Dreamers that has been
derived purely from browsing behavior.
Who are they?
•Dreamers repeatedly put an item in their basket, but
then abandon it. Dreamers often add and abandon the
same item many times.
This may be especially true for a high-value item like a
TV or computer. It should be possible to identify the
segment of people that does this repeatedly.
•So, what is the outcome of this segment” Dreamers”?
1. What is that the customers are abandoning?
•Perhaps a customer is looking at a high-end TV that is
quite expensive Or phone or Camera etc.

• is price the issue ? From the past data, we get to know


that the customer often aims too high and later will buy a
less-expensive product than the one that was abandoned
repeatedly.
Action Plan
•Sending an e-mail, pointing to less-expensive options or
other variety of High end TV.
2: Get to Know the Abandoned basket statistics . Which
can help organizations to know prospective customer
abandoning baskets.
[Helps analyst to output survey results such as 97%
customers abandoned their baskets. It also gives insights
into procedural aspects, unavailability of services like
COD, Credit card etc.]
Assessing Advertising Results
•Assessing paid search and online advertising results
is another high-impact analysis enabled with
customer level web behavior data.

•Traditional web analytics provide high-level


summaries such as total clicks, number of searches,
cost per click, keywords leading to the most clicks,
page position statistics etc.

• Most focus on single web channel.

•This means that all statistics are based only on what


happened during the single session generated from
the search or ad click
•Once a customer leaves the web site and web session
ends, the scope of the analysis is complete.
•There is no attempt to account for past or future visits
in the statistics.
•By incorporating customers’ browsing data and
extending the view to other channels as well, it is
possible to assess search and advertising results at a
much deeper level.
For Example:
• How many sales did the first click generate in days/weeks
• Are certain web sites drawing more customers from referred
sites.
• Cross channel analysis study, How sales are doing, after
information about the channel was provided on web via ad or
search.
CROSS SECTION OF BIG DATA
SOURCES AND VALUE THEY
HOLD
CASE STUDY

1. AUTO INSURANCE: THE VALUE OF TELEMATICS DATA


•Telematics involves putting a sensor, or black box, into
a car to capture information about what’s happening
with the car. This black box can measure any number of
things depending on how it is configured.
•It can monitor speed, mileage driven, or if there has
been any heavy braking.
•Telematics data helps insurance companies better
understand customer risk levels and set insurance rates.
•If privacy concerns are ignored and it is taken to the
extreme, a telematics device could keep track of
everywhere a car went, when it was there, how fast it
was going, and what features of the car were in use.
2. MULTIPLE INDUSTRIES: THE VALUE OF TEXT DATA

•Text is one of the biggest and most common sources of


big data. Just imagine how much text is out there.
•There are e-mails, text messages, tweets, social media
postings, instant messages, real-time chats, and audio
recordings that have been translated into text.

•Text data is one of the least structured and largest


sources of big data in existence today.

•Luckily, a lot of work has been done already to tame text


data and utilize it to make better business decisions

• Text mining approaches have their own


advantages/disadvantages
•Here, we will focus on, how to use the results, not
produce them.
•For example, once the sentiment of a customer’s e-
mail is identified, it is possible to generate a variable
that tags the customer’s sentiment as negative or
positive. That tag is now a piece of structured data that
can be fed into an analytics process.
•Creating structured data out of unstructured text is
often called information extraction.
•Another example, assume that we’ve identified which
specific products a customer commented about in his
or her communications with our company.
•We can then generate a set of variables that identify
the products discussed by the customer. Those
variables are again metrics that are structured and can
be used for analysis purposes.
MULTIPLE INDUSTRIES: THE VALUE OF TIME AND
LOCATION DATA

•With the advent of global positioning systems (GPS),


personal GPS devices, and cellular phones, time and
location information is a growing source of data.

• A wide variety of services and applications from


Google Places, to Facebook Places are centered on
registering where a person is at a given point in time.

•Cell phone applications can record your location and


movement on your behalf.

•Cell phones can even provide a fairly accurate location


using cell tower signals, if a phone is not formally GPS-
enabled.
•Example, there are applications that allow you to track
the exact routes you travel when you exercise, how
long the routes are, and how long it takes you to
complete the routes.

•The fact is, if you carry a cell phone, you can keep a
record of everywhere you’ve been. You can also open
up that data to others if you choose.

You might also like