Big Data Analytics Module 1
Big Data Analytics Module 1
Module 1
Introduction
when we talk about social media platforms , millions of users connect on daily basis.
information is shared whenever users use a social media platform or any other website, so the
question arises that how this huge amount of data is handled and through what medium or
tools the data is processed and stored. This is where Big Data comes into light.
Above figure shows the data usage and growth. As size and complexity increase, the
proportion of unstructured data types also increase.
An example of a traditional tool for structured data storage and querying in RDBMS.
Volume, velocity, and variety (3Vs) of data need the usage of number of programs and tools
for analyzing and processing data with high speed.
Big data requires new tools for processing and analysis of large volume of data. For example
NoSQL (Not only SQL), Hadoop technology.
1
BIGDATA ANALYTICS (18CS72)
Big Data:
Definitions of Data:
Web is large large scale information and presence of data on web servers.
Web is a part of the internet that stores web data in the form of documents and other web
resources.
Web data is the data present on web servers in the form of text, images, videos, audios and
multimedia files for web users.
Internet applications including web sites, web services, web portals, emails, chats will
consume the web data.
Examples of web data are Wikipedia, Google maps, McGraw-Hill Connect and YouTube.
Classification of Data
Data can be classified as : Structured Data
Semi-structured data
Unstructured data
Structured Data
Structured is one of the types of data and by structured data, we mean, data that can be
processed, stored, and retrieved in a fixed format.
It refers to highly organized information that can be readily and easily stored and accessed
from a database by simple search engine algorithms.
Structured data conform and associate with data schemas and data models.
2
BIGDATA ANALYTICS (18CS72)
For instance, the employee table in a company database will be structured, as the employee
details, their job positions, their salaries, etc., will be present in an organized manner.
Structured data enables data insert, delete, update, append, indexing to enable faster retrieval,
scalability and transaction processing which follows ACID rules.
Unstructured data
Unstructured data refers to the data that lacks any specific form or structure.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email, .CSV, .TXT, mobile data, social media and Image files are the examples of
unstructured data.
Semi-structured data
Semi structured is the third type of data. This data pertains to the data containing both the
formats mentioned above, that is, structured and unstructured data.
Data do not associate data models , such as the relational database and table models.
Big data is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.
A collection of data sets so large or complex that traditional data processing applications are
inadequate.
Big data refers to data sets whose size is beyond the ability to typical database software tool
to capture, store ad manage and analyze.
3
BIGDATA ANALYTICS (18CS72)
a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured data that is
gathered from multiple sources.
In early days, data could only be collected from spreadsheets and databases, today data
comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so
much more. Variety is one of the important characteristics of big data.
b) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts.
c) Volume
We already know that Big Data indicates huge volumes’ of data that is being generated on a
daily basis from various sources like social media platforms, business processes, machines,
networks, human interactions, etc. Such a large amount of data is stored in data warehouses.
d) Veracity
4
BIGDATA ANALYTICS (18CS72)
Quality of data captured, which can vary greatly, affecting its accurate analysis.
Transaction data and Business process(BP) data, such as credit card transactions, flight
bookings, etc.
Customer master data, such as data for facial recognition, date of birth, locality, salary
specification.
Machine generated data, such as machine to machine or Internet of Things data and data
from sensors, trackers, web logs.
Human generated data, such as biometric data, Human machine interaction data, Email
records with mail server, Humans also records their experiences.
Below table gives various classification methods for data and Big Data.
5
BIGDATA ANALYTICS (18CS72)
* Huge data volumes storage, data distribution, high speed networks and high performance
computing.
6
BIGDATA ANALYTICS (18CS72)
* Application scheduling using Open source, reliable, scalable, distributed file system,
distributed database, parallel and distributed computing systems such as Hadoop or Spark.
* Open source tools which are scalable, elastic and provide virtualized environment.
* Data mining and analytics, data retrieval, data reporting, Machine learning Big data tools.
Processing complex applications with large datasets need hundreds of computing nodes.
Processing of this much distributed data within a short time and at minimum cost is
problematic.
Big Data can co-exist with traditional data store. Traditional data store usr RDBMS tables or
data warehouse.
Big data processing and analytics requires scaling up and scaling out, both vertical and
horizontal computing resources.
Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
Scaling up: Designing the algorithm according to the architecture that uses resources
efficiently.
Horizontal scalability: Increasing the number of systems working in coherence and scaling out
the work load.
Scaling out: Scaling out means using more resources and distributing the processing and storage
tasks in parallel.
7
BIGDATA ANALYTICS (18CS72)
The easiest way to scale up and scale out execution of analytics software is to implement it
on a bigger machine with more CPUs for greater volume, velocity, variety and complexity
of data.
Alternative ways for scaling up and scaling out processing of analytics software are
Massively Parallel Processing Platforms(MPPs), cloud, Grid, Clusters and distributed
computing software.
Above table gives requirements of processing and analyzing big, large and small to medium
datasets on distributed computing nodes.
Cloud Computing
8
BIGDATA ANALYTICS (18CS72)
Cloud computing is a type of Internet based computing that provides shared processing
resources and data to the computers and other devices on demand.
One of the best approach for data processing is to perform parallel and distributed
computing in a cloud computing environment.
Grid computing refers to distributed computing, in which a group of computers from several
locations are connected with each other to achieve a common tasks.
A group of computers that might spread over remotely comprise a grid. A grid is used for
variety of purposes.
Drawbacks: Single point failure, While sharing the resources the increase in the load
capacity.
2. Cluster Computing
9
BIGDATA ANALYTICS (18CS72)
Volunteer Computing
Volunteer provide computing resources to projects of importance that use resources to do
distributing computing or storage.
Issues: Volunteered computers heterogeneity , Drop outs from the network over time, Their
sporadic availability, Incorrect results at volunteers.
10
BIGDATA ANALYTICS (18CS72)
Big Data architecture is the logical and/or physical layout of how big data will be stored,
accessed and managed within a Big Data or IT environment.
Above figure shows the logical layers and the functions which are considered in Big Data
Architecture.
Five vertically aligned text boxes on the left side of the above figure shows the layers.
3. Data storage
4. Data processing
5. Data consumption
11
BIGDATA ANALYTICS (18CS72)
Ingestion and ETL processes either in real time, which means store and use the data as
generated, or in batches.
Data storage type, format, compression, incoming data frequency,querying patterns and
consumption requirements for L4 or L5.
Data storage using Hadoop distributed File System(HDFS) or NoSQL data stores - HBase,
Cassandra, MongoDB.
Data processing software such as MapReduce, Hive, pig, Spark, Spark Streaming.
Data integration
12
BIGDATA ANALYTICS (18CS72)
2. Data governance, which includes establishing the processes for ensuring the
availability of data.
Data sources can be external such as sensors, trackers, web logs, computer systems logs and
feeds.
Data sources can be internal such as databases, relational db, flat files, spreadsheets, mail
server
Data source for storage and processing can be file, database or streaming data.
Examples for structured data sources are SQL server, MySQL,Microsoft access Database
oracle DBMS
A data source name should be meaningful name. For example, a name which identifies the
stored data in student grades during processing. The data source name could be
StudentName_Data_Grades.
A data dictionary enables references for access to data. The dictionary stores at a central
location.
Microsoft applications consider 2 types of sources for processing such as Machine sources
and File sources.
13
BIGDATA ANALYTICS (18CS72)
Unstructured data sources are distributed over high speed networks. The data need high
velocity processing.
The sources are of file types such as .txt(text file), .csv(Comma separated values file)
The data sources can be sensors, sensor networks, signals from machines, devices,
controllers.
Sensors are the electronic devices that sense the physical environment. These are the devices
which are used for measuring temperature, pressure, humidity.
Data Quality
Data quality refers to the state of qualitative and quantitative pieces of information.
High quality means data which enables all the required operations, analysis, decisions,
planning and knowledge discovery correctly.
A definition for data quality can be defined with 5R those are Relevancy, recency,
robustness,range and reliability.
1. Data integrity
Data integrity refers to the maintenance of consistency and accuracy in data over its usable
life.
Software, which store, process, or retrieve the data, should maintain the data integrity of
data.
2. Noise
14
BIGDATA ANALYTICS (18CS72)
3. Outliers
An outlier in data refers to data, which appears to not belong to the data set.
The outliers are the results of the human data entry errors.
4. Missing Values
5. Duplicate Values
Duplicate value implies the same data appearing two or more times in a data set.
Data Pre-processing
Data pre-processing is an important step in the layer 2 of architecture.
In the pre-processing step we will first remove the machine generated data like noise,
outlier, missing values, Duplicate values.
2. Filtering(Removing Noise)
3. Data cleaning,editing
4. Validation
Data Cleaning
15
BIGDATA ANALYTICS (18CS72)
Data Enrichment
Data enrichment refers to operations or processes which refine, enhance or improve the raw
data.
Data Editing
Data editing refers to the process of reviewing and adjusting the acquired data sets.
Data Reduction
Data reduction enables the transformation of acquired information into an ordered, correct,
and simplified form.
Data Wrangling
It refers to the process of transforming and mapping the data. Results from analytics are then
appropriate and valuable.
Key-Value pairs
Hash-key-value pairs
16
BIGDATA ANALYTICS (18CS72)
Above figure shows the data pre-processing, data mining, analysis. Visualization and data
store.
Cloud Services
Cloud offers various services(IaaS, Paas, SaaS).
These services can be accessed through a cloud client(client application), such as web
browser, SQL.
Below figure shows the data store export from machines, files, computers, web servers and
web services.
17
BIGDATA ANALYTICS (18CS72)
Big Query
Google cloud platform provides cloud services called Big Query.
The data exports from a table or partition schema, JSON, CSV, or AVRO files fromdata
sources after the pre-processing.
Data store first pre-processes from machine and file data sources.
Cloud service BigQuery consists of bigquery.tables.create, bigquery.dataEditor,
bigquery.dataOwner.
Below figure shows the BigQuery cloud service at Google cloud platform.
18
BIGDATA ANALYTICS (18CS72)
SQL
SQL is mainly used for the updating and viewing the database tables.
SQL can embed within other languages using other SQL models.
a) Create schema: Which contains the description of the objects(base tables, views)
created by users.
b) Create catalog: Which consist of a set of schemas which describes the database.
19
BIGDATA ANALYTICS (18CS72)
e) Data control Language (DCL): Commands that control a database and provides the
administrative privileges.
The tables will store the data in rows and columns format.
In storing process the data management will involves privacy, security, and Data integration.
Set of keys will be used for access the data fields. Ex: (primary key, foreign key)
c) Should be “location independent” which means the user is unaware of where the data
is located, and it is possible to move the data from one physical location to another
without affecting the user.
It allows faster data retrieval when only a few columns in table need to be selected during
query processing or aggregation.
20
BIGDATA ANALYTICS (18CS72)
Online Analytical Processing(OLAP) in real time transaction is fast when using In-Memory
Col format tables.
Online Analytical Processing will be used for the viewing and visualization of data.
A row format in-memory allows much faster data processing during OLTP(Online
Transaction Processing).
After pre-processing and cleaning the data will be integrated on a data warehouse.
Enterprise data server use data from several distributed sources which store data using
various technologies.
All data will be merged using an integration tool. (Refer above figure)
21
BIGDATA ANALYTICS (18CS72)
2. Business Intelligence(BI)
5. E-commerce
NoSQL databases are considered as semi-structured data. Big data store uses NoSQL.
Features of NoSQL
*May relax the ACID rules during the data store transactions(
Atomicity,Consistency, Isolation, Durability).
Below figures shows the co-existence of data at server, RDBMS with NoSQL and Big Data
at hadoop, spark, Mesos and various data sources for Big Data along with its examples of
usages and the tools used respectively.
22
BIGDATA ANALYTICS (18CS72)
23
BIGDATA ANALYTICS (18CS72)
The data generate at high speed because of this Big Data requires large resources of MPPs,
Cloud, Parallel processing.
3) Custom development
4) Querying
2) Massive parallelism
Hadoop
Big Data platform consist of Big Data storage(S), Servers(S), and data management and
Business Intelligence(BI) software
Storage can deploy HDFS, NoSQL data stores such as HBase, MongoDB, Cassendra.
Hadoop is an open source frame work for the storage of large amount of data.
24
BIGDATA ANALYTICS (18CS72)
Mesos
Mesos v0.9 is a resource management platform which enables sharing of cluster of nodes by
multiple frame works.
Application, Machine learning algorithms, analytics tools use Big Data Stack(BDS) at a
cloud service.
25
BIGDATA ANALYTICS (18CS72)
Analysis brings the correct order, structured and meaning to the collection of the data
Data Analytics Definition
Statistical and mathematical data analysis that predicts the future possibilities.
Phases in Analytics
1) Descriptive analytics
Descriptive analytics is a statistical method that is used to search and summarize historical
data in order to identify patterns or meaning.
Data aggregation and data mining are two techniques used in descriptive analytics to
discover historical data. Data is first gathered and sorted by data aggregation in order to
make the datasets more manageable by analysts.
Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning. Identified patterns are analyzed to discover the specific ways
that learners interacted with the learning content and within the learning environment.
Advantages:
Quickly and easily report on the Return on Investment (ROI) by showing how performance
Identify gaps and performance issues early - before they become problems.
Identify specific learners who require additional support, regardless of how many students
26
BIGDATA ANALYTICS (18CS72)
Analyze the value and impact of course design and learning resources.
2) Predictive analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviors.
For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as specialized
software. For the learner, predictive forecasting could be as simple as a dashboard located on
the main screen after logging in to access a course.
Analyzing data from past and current progress, visual indicators in the dashboard could be
provided to signal whether the employee was on track with training requirements.
Advantages:
Personalize the training needs of employees by identifying their gaps, strengths, and
weaknesses; specific learning resources and training can be offered to support individual
needs.
Retain Talent by tracking and understanding employee career progression and forecasting
what skills and learning resources would best benefit their career paths. Knowing what skills
Support employees who may be falling behind or not reaching their potential by offering
Simplified reporting and visuals that keep everyone updated when predictive forecasting
27
BIGDATA ANALYTICS (18CS72)
is required.
Prescriptive analytics
Cognitive Analytics
Above diagram illustrates the reference model for the Traditional and Big data analytics.
For storing large amount of data it requires the well-proven techniques to calculate,plan and
analyze.
28
BIGDATA ANALYTICS (18CS72)
First we need to understand whether gathered data is able to help in obtaining the following
results are not:
1) Cost reduction
2) time reduction,
4) knowledge discovery
BDAS is an open source data analytics stack for complex computations on Big data.
It supports efficient, large scale in-memory data processing and thus enables user
applications achieving 3 fundamental processing requirements those are accuracy, time and
cost.
BDAS consists of Data processing, Data management and resource management layers.
1) Applications, AMP-Genomics and Carat run at the BDAS. Data processing software
components provides in-memory processing which processes the data efficiently across
the frameworks. AMP stands for Berkeley’s algorithms, Machines and peoples
laboratory.
Data are important for most aspect of marketing, sales and advertising.
Customer Value Analytics(CVA) is the tool for analyzing what a customer really needs.
CVA really makes it possible for leading marketers like amazon,flip-cart etc.
29
BIGDATA ANALYTICS (18CS72)
* Large volume and velocity of big data provides greater opportunities but it
results some risks like less accurate and low quality of data.
Bigdata analytics in health care use the following data sources : Clinical records,Pharmacy
records,Electronic medical records,Diagnosis logs and notes.
Value based and customer centric health care: Cost effective patient care by improving
healthcare quality using latest knowledge, usages of electronic health and medical records.
30
BIGDATA ANALYTICS (18CS72)
Health care Internet of Things: Which creates the unstructured data. The data enables the
monitoring of the devices data for patient parameters, such as glucose,BP and ECGs.
Prevention of Fraud, waste, and abuse: Uses big data predictive analytics and help
resolve excessive or duplicate claims in a systematic manner.
Patient Real time monitoring: Uses machine learning algorithms which process real time
events.
Big data offers potential to transform medicine and health care system.
Aggregating large volume and variety of information around from multiple sources the
DNAs, proteins, and metabolites to cells, tissues, organs, that can understand the biology of
diseases.
Advertising on digital medium needs Optimization. Too much usage also effect negatively.
31