Big Data Analytics_AAM_Unit 1
Big Data Analytics_AAM_Unit 1
applications
CO4: Apply the different clustering
techniques
CO5: Use different Frame works and
Visualization techniques
Unit I
Introduction To Big Data: What Is Big Data? Is The
"Big" Part Or The "Data" Art More Important? How Is
Big Data Different? How Is Big Data More Of The
Same? Risks Of Big Data -Why You Need To Tame
Big Data -The Structure Of Big Data- Exploring Big
Data, Most Big Data Doesn't Matter- Filtering Big
Data Effectively -Mixing Big Data With Traditional
Data- The Need For Standards-Today's Big Data Is
Not Tomorrow's Big Data. Web Data: The Original
Big Data -Web Data Overview -What Web Data
Reveals -Web Data In Action? A Cross-Section Of Big
Data Sources And The Value They Hold.
Unit II
Data Analysis: Evolution Of Analytic
Scalability – Convergence – Parallel
Processing Systems – Cloud Computing –
Grid Computing – Map Reduce – Enterprise
Analytic Sand Box – Analytic Data Sets –
Analytic Methods – Analytic Tools – Cognos –
Microstrategy - Pentaho. Analysis
Approaches – Statistical Significance –
Business Approaches – Analytic Innovation –
Traditional Approaches – Iterative
Unit III
Mining Data Streams : Introduction To
Streams Concepts, Stream Data Model And
Architecture, Stream Computing, Sampling
Data In A Stream, Filtering Streams,
Counting Distinct Elements In A Stream,
Estimating Moments, Counting Oneness In A
Window, Decaying Window, Realtime
Analytics Platform(RTAP) Applications, Case
Studies, Real Time Sentiment Analysis,
Stock Market Predictions.
Unit IV
Frequent Itemsets And Clustering :
Mining Frequent Itemsets - Market Based
Model – Apriori Algorithm – Handling Large
Data Sets In Main Memory – Limited Pass
Algorithm – Counting Frequent Itemsets In A
Stream – Clustering Techniques –
Hierarchical – K- Means – Clustering High
Dimensional Data – CLIQUE And PROCLUS –
Frequent Pattern Based Clustering Methods
– Clustering In Non-Euclidean Space –
Clustering For Streams And Parallelism.
Unit V
Frameworks And Visualization :
Mapreduce – Hadoop, Hive, Mapr – Sharding
– Nosql Databases - S3 - Hadoop Distributed
File Systems – Visualizations - Visual Data
Analysis Techniques, Interaction Techniques;
Systems And Applications:
Unit I
Introduction To Big Data
What is Big Data?
As you drive to the store to buy the computer bundle, you get an offer for a
discounted coffee from the coffee shop you are getting ready to drive past.
It says that since you’re in the area, you can get 10% off if you stop by in the
next 20 minutes
Finally, once you get back home, you receive notice of a gadget upgrade
available for purchase in your favorite online video game.
Etc…………..
DATA SOURCES
• Explosion of new and powerful data sources like Facebook,
Twitter, LinkedIn, Youtube etc., contributes immensely to
Bigdata & research.
• Advance Analytics will be of great impact.
• To stay competitive, it is imperative that organizations
aggressively pursue capturing and analyzing these new data
sources to gain the insights that they offer.
• Ignoring big data will put an organization at risk and cause it to
fall behind the competition.
• Analytic professionals have a lot of work to do! It won’t be easy
to incorporate big data alongside all the other data that has
been used for analysis for years.
Big Data?
500 Million Tweets sent each day!
More than 4 Million Hours of content uploaded to
Youtube every day!
3.6 Billion Instagram Likes each day.
4.3 BILLION Facebook messages posted daily!
5.75 BILLION Facebook likes every day.
40 Million Tweets shared each day!
6 BILLION daily Google Searches!
Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
Data volume is increasing
exponentially
Exponential increase in
collected/generated data
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world
wide
100s of
millions
of GPS
data every
of
enable
? TBs
day
d devices
sold
annually
25+ TBs of
log data 2+
every day billion
people
on the
76 million smart Web by
meters in 2009… end
200M by 2014 2011
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB
The Earthscope
• The Earthscope is the world's
largest science project. Designed
to track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scienc
e-future_of_technology/
#.TmetOdQ--uI)
Variety (Complexity)
Relational Data (Tables/Transaction/Legacy
Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
◦ Social Network, Semantic Web (RDF), …
Streaming Data
◦ You can only scan the data once
Banki
Social ng
Media Financ
e
Our
Know
Customer
Gami
n
ng
Histor
y
Entertai Purcha
n se
Velocity (Speed)
Mobile devices
(tracking all objects all the time
Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter
Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Variability :
•It is often confused with variety.
Example:
•Say you have bakery that sells 10 different breads.
That is variety. Now imagine you go to that bakery
three days in a row and every day you buy the same
type of bread but each day it tastes and smells
different.
•Variability is thus very relevant in performing
sentiment analyses.
•Variability means that the meaning
is changing (rapidly).
Some Make it 4V’s
Visualization
In Summary:
•Whether it stays big or whether it ends up being
small when you’re done processing it,
[The key here is to get the right people. You need the
right people attacking big data and attempting to solve
the right kinds of problems]
•Unstructured Data
Step 5: Results:
• Positive opinion
• Negative opinion
•The complexity of the rules and the magnitude of the
data being
removed or kept at each stage will vary by data source
and by business problem.
•The load processes and filters that are put on top of
big data are absolutely critical. Without getting those
correct, it will be very difficult to succeed.
•Traditional structured data doesn’t require as much
effort in these areas since it is specified, understood,
and standardized in advance.
•With big data, it is necessary to specify, understand,
and standardize it as part of the analysis process in
many cases.
b. Data
Warehouse
Hierarchy of Enterprise Data
HE NEED FOR STANDARDS
•Will big data continue to be a wild west of crazy
formats, unconstrained streams, and lack of definition?
•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability
across distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data
TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA
Example 1:
•This is all useful, but an airline can get even more from
•An airline can identify customers who value
convenience (Such customers typically start searches for
specific times and direct flights only.)
•Airlines can also identify customers who value price first
and foremost and are willing to consider many flight
options to get the best price.
•Does it matter?
Web Data in Action
•What an organization knows about its customers is
never the complete picture.
•If there is only a partial view, the full view can often be
extrapolated accurately enough to get the job done.
Case 1: BANK
• Mr.Kumar has an account with
PNB………………………………….etc. with relevant
information.
Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”,
goes to Google and types “How do I cancel my Provider
AIR contract?” (Web Data).
• Company Analysts, perhaps not, would have seen her
usage dropping.
•What is segmentation?
•How Segmentation were done traditionally?
•Web data also enables segmentation of customers
based on their typical browsing patterns.
(Seminar/Project topic on assessing browsing pattern of
users)
•Such segmentation will provide a completely different
view of customers than traditional demographic or sales-
based segmentation schemas.
•The fact is, if you carry a cell phone, you can keep a
record of everywhere you’ve been. You can also open
up that data to others if you choose.