Unit 1 - BD - Introduction To Big Data
Unit 1 - BD - Introduction To Big Data
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Business Data
Big Data Science
Analytic
s Real-time
Job Market Usability
How?
Prerequisites
Database Management System
School of Computer Engineering
Course Contents
Sr # Major and Detailed Coverage Area Hrs
1 Introduction to Big Data 6
Importance of Data, Characteristics of Data Analysis of Unstructured Data,
Combining Structured and Unstructured Sources. Introduction to Big Data Platform
– Challenges of conventional systems – Web data – Evolution of Analytic scalability,
analytic processes and tools, Analysis vs reporting – Modern data analytic tools,
Types of Data, Elements of Big Data, Big Data Analytics, Data Analytics Lifecycle.
2 Big Data Technology Foundations 8
Exploring the Big Data Stack, Data Sources Layer, Ingestion Layer, Storage Layer,
Physical Infrastructure Layer, Platform Management Layer, Security Layer,
Monitoring Layer, Analytics Engine, Visualization Layer, Big Data Applications,
Virtualization. Introduction to Streams Concepts – Stream data model and
architecture – Stream Computing, Sampling data in a stream – Filtering streams,
Counting distinct elements in a stream.
cycle approach.
Classify and examine the data under big data stack and associated
technologies.
Evaluate big data technologies to analyze big data and create models.
Grading:
Internal assessment – 30 marks
2 quizzes = 2.5 X 2 = 5 marks
2 group assignments = 2.5 X 2 = 5 marks
Research Paper Presentation = 10 marks
Mini Project = 10 marks
Mid-Term exam - 20 marks
End-Term exam - 50 marks
?
School of Computer Engineering
Data
A representation of information, knowledge, facts, concepts or instructions
which are being prepared or have been prepared in a formalized manner.
Data is either intended to be processed, is being processed, or has been
processed.
It can be in any form stored internally in a computer system or computer
network or in a person’s mind.
Since the mid-1900s, people have used the word data to mean computer
information that is transmitted or stored.
Data is the plural of datum (a Latin word meaning something given), a single
piece of information. In practice, however, people use data as both the
singular and plural form of the word.
It must be interpreted, by a human or machine to derive meaning.
It is presents in homogeneous sources as well as heterogeneous sources.
The need of the hour is to understand, manage, process, and take the data
for analysis to draw valuable insights.
Data Information Knowledge Actionable Insights
School of Computer Engineering
Importance of Data
The ability to analyze and act on data is increasingly important to
businesses. It might be part of a study helping to cure a disease, boost a
company’s revenue, understand and interpret market trends, study
customer behavior and take financial decisions
The pace of change requires companies to be able to react quickly to
changing demands from customers and environmental conditions. Although
prompt action may be required, decisions are increasingly complex as
companies compete in a global marketplace
Managers may need to understand high volumes of data before they can
make the necessary decisions
Relevant data creates strong strategies - Opinions can turn into great
hypotheses, and those hypotheses are just the first step in creating a strong
strategy. It can look something like this: “Based on X, I believe Y, which will
result in Z”
Relevant data strengthens internal teams
Relevant data quantifies the purpose of the work
School of Computer Engineering
Characteristics of Data
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages
Inconsistent
Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
School of Computer Engineering
Challenges associated with Unstructured data
Working with unstructured data poses certain challenges, which are as follows:
Identifying the unstructured data that can be processed
Sorting, organizing, and arranging unstructured data indifferent sets and
formats
Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information
Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
High-variety
Cost-effective,
innovative
forms of Big Data is high-volume, high-velocity,
information
processing
and high-variety information assets that
demand cost effective, innovative forms
Enhanced of information processing for enhanced
insight &
decision making insight and decision making.
Source: Gartner IT Glossary
Semi- Big
Structured Unstructure
structured Data
Data d Data
Data
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues
School of Computer Engineering
Elements of Big Data
In most big data circles, these are called the four V’s: volume, variety, velocity, and veracity.
(One might consider a fifth V, value.)
Volume - refers to the incredible amounts of data generated each second from social media,
cell phones, cars, credit cards, M2M sensors, photographs, video, etc. The vast amounts of
data have become so large in fact it can no longer store and perform data analysis using
traditional database technology. So using distributed systems, where parts of the data is
stored in different locations and brought together by software.
Variety - defined as the different types of data the digital system now use. Data today looks
very different than data from the past. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used
simultaneously.
Velocity - refers to the speed at which vast amounts of data are being generated, collected
and analyzed. Every second of every day data is increasing. Not only must it be analyzed,
but the speed of transmission, and access to the data must also remain instantaneous to
allow for real-time access. Big data technology allows to analyze the data while it is being
generated, without ever putting it into databases.
Veracity - is the quality or trustworthiness of the data. Just how accurate is all this data?
For example, think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content.
School of Computer Engineering
Elements of Big Data cont’d
Value - refers to the ability to transform a tsunami of data into business. Having endless
amounts of data is one thing, but unless it can be turned into value it is useless.
Refer to Appendix
for data volumes
More data for analysis will result into greater analytical accuracy and greater
confidence in the decisions based on the analytical findings. This would entail a greater
positive impact in terms of enhancing operational efficiencies, reducing cost and time,
and innovating on new products, new services and optimizing existing services.
More data
A human being lives in a social environment and gains knowledge and experience
through communication. Today, communication is not restricted to meeting in person.
Internet and mobile have made communication and sharing of data possible across the
globe. Some social networking sites such as Twitter, Facebook, and LinkedIn produces
data from people. Social network Analysis (SNA) is the analysis performed on the data
obtained from social media. As such data is generated in huge volume, it results in the
formation of a Big Data pool.
Instagram users share 1 M Youtube users upload 72 hrs Email Users send 200 M
pieces of contents of new video messages
Amazon generates over Twitter users sends over Facebook users share 2.5 M
$80,000 in online sales 300,000 tweets pieces of contents
The following are the areas in which decision-making process are influenced by social
network data:
Business Intelligence: It is a data analysis process to convert a raw dataset to
meaningful information by using different techniques and tools for boosting business
performances. This system allows a company to collect, store, access, and analyze
data for adding value to the decision making.
Product design and development: With the increasing popularity of all social
media and growing volume of data every second, organizations competing to make a
big in the market must not only identify and extracts the information relevant to their
company, products, and services but also comprehend and respond to the
information on a continuous basis. By listening to what customers want, by
understanding where the gap in the offering is, and so on, organizations can make the
right directions in the direction of their product development and offerings. In this
way, social network data can help organizations to improve product development
and services, making sure that the customers ultimately get the products and services
they want.
Approach Explanation
Descriptive What’s happening in my business?
•Comprehensive, accurate and historical data
•Effective Visualisation
Diagnostic Why is it happening?
•Ability to drill-down to the root-cause
•Ability to isolate all confounding information
Predictive What’s likely to happen?
•Decisions are automated using algorithms and technology
•Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do?
•Recommended actions and strategies based on champion/challenger
strategy outcomes
•Applying advanced analytical algorithm to make specific
recommendations
History data can be quite large. There might be a need to process huge amount of data many times a
day as it gets updated continuously. Therefore volume is mapped to history. Variety is pervasive.
Input data, insights, and decisions can span a variety of forms, hence it is mapped to all three. High
velocity data might have to be processed to help real time decision making and plays across
descriptive, predictive, and prescriptive analytics when they deal with present data. Predictive and
prescriptive analytics create data about the future. That data is uncertain, by nature and its veracity
is in doubt. Therefore veracity is mapped to prescriptive and predictive analytics when it deal with
future.
It goes without saying that the world of big data requires new levels of scalability. As the
amount of data organizations process continues to increase, the same old methods for
handling data just won’t work anymore. Organizations that don’t update their
technologies to provide a higher level of scalability will quite simply choke on big data.
Luckily, there are multiple technologies available that address different aspects of the
process of taming big data and making use of it in analytic processes.
Traditional Analytics Architecture
Database 1
Analytic Server
Database 2
Extract
Database 3
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
School of Computer Engineering
Analysis vs. Reporting
Reporting - The process of organizing data into informational summaries in
order to monitor how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract
meaningful insights, which can be used to better understand and improve
business performance.
Difference b/w Reporting and Analysis:
Reporting translates raw data into information. Analysis transforms data
and information into insights.
Reporting helps companies to monitor their online business and be alerted
to when data falls outside of expected ranges. Good reporting should raise
questions about the business from its end users. The goal of analysis is to
answer questions by interpreting the data at a deeper level and providing
actionable recommendations.
In summary, reporting shows you what is happening while analysis focuses
on explaining why it is happening and what you can do about it.
Big data analytics is the process of extracting useful information by analysing different
types of big data sets. It is used to discover hidden patterns, outliers, unearth trends,
unknown co-relationship and other useful info for the benefit of faster decision making.
Big Data Application in different Industries
Big Data
Analytics isn’t
“One-size-fit-all” traditional
Only used by huge online Meant to replace data
RDBMS built on shared
companies warehouse
disk and memory
Source: https://en.wikipedia.org/wiki/Symmetric_multiprocessing
School of Computer Engineering
Terminologies used in Big Data cont’d
Parallel Systems: A parallel database system is a tightly coupled system. The
processors co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the system.
P1 P2 P3
P2
User User User User
P1 P3
Network
Next, the client request that v1 be written to S1. Since the system is Client
available, S1 must respond. Since the network is partitioned, however, S1
cannot replicate its data to S2. This phase of execution is called α1.
S1 S2 S1 S2 S1 S2
V0 V0 V1 V0 V1 V0
Write V1 done
Client Client Client
Next, the client issue a read request to S2. Again, since the system is available,
S2 must respond and since the network is partitioned, S2 cannot update its
value from S1. It returns v0. This phase of execution is called α2.
S1 S2 S1 S2
V1 V0 V1 V0
read V0
Client Client
S2 returns v0 to the client after the client had already written v1 to S1. This is
inconsistent.
We assumed a consistent, available, partition tolerant system existed, but we
just showed that there exists an execution for any such system in which the
system acts inconsistently. Thus, no such system exists.
After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the results
in practice.
The utilization of Analysis results is dedicated to determining how
and where the processed data can be further utilized to leverage the
result of the Big Data Project.
Depending on the nature of the analysis problems being addressed, it
is possible for the analysis results to produce “models” that
encapsulate new insights and understandings about the nature of the
patterns and relationships that exist within the data that was
analyzed.
A model may look like a mathematical equation or a set of rules.
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new system
or software program.
School of Computer Engineering
Summary
Detailed Lessons
Importance of Data, Characteristics of Data Analysis of Unstructured Data, Combining
Structured and Unstructured Sources. Introduction to Big Data Platform – Challenges of
conventional systems – Web data – Evolution of Analytic scalability, analytic processes
and tools, Analysis vs reporting – Modern data analytic tools, Types of Data, Elements
of Big Data, Big Data Analytics, Data Analytics Lifecycle.
Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information
from noisy unstructured text data.
School of Computer Engineering
Appendix cont…
ETL: ETL is short for extract, transform, load, three database functions that are
combined into one tool to pull data out of one database and place it into another
database.
Extract is the process of reading data from a database. In this stage, the data is
collected, often from multiple and different types of sources.
Transform is the process of converting the extracted data from its previous form
into the form it needs to be in so that it can be placed into another database.
Transformation occurs by using rules or lookup tables or by combining the data
with other data.
Load is the process of writing the data into the target database.