0% found this document useful (0 votes)
7 views

Big Data Analytics Module 1

The document provides an overview of Big Data Analytics, highlighting the need for advanced tools to handle the massive volume, velocity, and variety of data generated from social media and other sources. It defines different types of data (structured, semi-structured, unstructured) and discusses the characteristics and classifications of Big Data, as well as techniques for data storage, processing, and management. Additionally, it covers the architecture for Big Data systems and the importance of data quality and governance in analytics.

Uploaded by

abhijnas005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Big Data Analytics Module 1

The document provides an overview of Big Data Analytics, highlighting the need for advanced tools to handle the massive volume, velocity, and variety of data generated from social media and other sources. It defines different types of data (structured, semi-structured, unstructured) and discusses the characteristics and classifications of Big Data, as well as techniques for data storage, processing, and management. Additionally, it covers the architecture for Big Data systems and the importance of data quality and governance in analytics.

Uploaded by

abhijnas005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

BIGDATA ANALYTICS (18CS72)

Module 1
Introduction

Need of Big data


 In today’s era, numerous social apps are being developed which result in
increasing data massively every day.

 when we talk about social media platforms , millions of users connect on daily basis.
information is shared whenever users use a social media platform or any other website, so the
question arises that how this huge amount of data is handled and through what medium or
tools the data is processed and stored. This is where Big Data comes into light.

Fig : Evolution of Big data and their characteristics

 Above figure shows the data usage and growth. As size and complexity increase, the
proportion of unstructured data types also increase.

 An example of a traditional tool for structured data storage and querying in RDBMS.

 Volume, velocity, and variety (3Vs) of data need the usage of number of programs and tools
for analyzing and processing data with high speed.

 Big data requires new tools for processing and analysis of large volume of data. For example
NoSQL (Not only SQL), Hadoop technology.

1
BIGDATA ANALYTICS (18CS72)

Big Data:
Definitions of Data:

 Data is piece of information, usually in the form of facts or statistics.

 Data is information that can be stored and used by a computer program.

 Data is information presented in numbers, letters or other forms.

 Data is information from series of observations, measurements or facts.

Definition of Web data:

 Web is large large scale information and presence of data on web servers.

 Web is a part of the internet that stores web data in the form of documents and other web
resources.

 Web data is the data present on web servers in the form of text, images, videos, audios and
multimedia files for web users.

 Internet applications including web sites, web services, web portals, emails, chats will
consume the web data.

 Examples of web data are Wikipedia, Google maps, McGraw-Hill Connect and YouTube.

Classification of Data
Data can be classified as : Structured Data

Semi-structured data

Unstructured data

Structured Data
 Structured is one of the types of data and by structured data, we mean, data that can be
processed, stored, and retrieved in a fixed format.

 It refers to highly organized information that can be readily and easily stored and accessed
from a database by simple search engine algorithms.

 Structured data conform and associate with data schemas and data models.

2
BIGDATA ANALYTICS (18CS72)

 For instance, the employee table in a company database will be structured, as the employee
details, their job positions, their salaries, etc., will be present in an organized manner.

 Structured data enables data insert, delete, update, append, indexing to enable faster retrieval,
scalability and transaction processing which follows ACID rules.

Unstructured data

 Unstructured data refers to the data that lacks any specific form or structure.

 This makes it very difficult and time-consuming to process and analyze unstructured data.

 Email, .CSV, .TXT, mobile data, social media and Image files are the examples of
unstructured data.

Semi-structured data

 Semi structured is the third type of data. This data pertains to the data containing both the
formats mentioned above, that is, structured and unstructured data.

 Data do not associate data models , such as the relational database and table models.

Big Data Definitions

 Big data is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.

 A collection of data sets so large or complex that traditional data processing applications are
inadequate.

 Big data refers to data sets whose size is beyond the ability to typical database software tool
to capture, store ad manage and analyze.

3
BIGDATA ANALYTICS (18CS72)

Big data Characteristics

Characteristics of Big data, called 3Vs(and 4Vs also used) are:

a) Variety

 Variety of Big Data refers to structured, unstructured, and semi-structured data that is
gathered from multiple sources.

 In early days, data could only be collected from spreadsheets and databases, today data
comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so
much more. Variety is one of the important characteristics of big data.

b) Velocity

 Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts.

c) Volume

 Volume is one of the characteristics of big data.

We already know that Big Data indicates huge volumes’ of data that is being generated on a
daily basis from various sources like social media platforms, business processes, machines,
networks, human interactions, etc. Such a large amount of data is stored in data warehouses.

d) Veracity

4
BIGDATA ANALYTICS (18CS72)

 Quality of data captured, which can vary greatly, affecting its accurate analysis.

Big Data Types


 Social network and web data, such as Facebook, Twitter, Emails and YouTube.

 Transaction data and Business process(BP) data, such as credit card transactions, flight
bookings, etc.

 Customer master data, such as data for facial recognition, date of birth, locality, salary
specification.

 Machine generated data, such as machine to machine or Internet of Things data and data
from sensors, trackers, web logs.

 Human generated data, such as biometric data, Human machine interaction data, Email
records with mail server, Humans also records their experiences.

Big Data Classification:


 Big Data can be classified on the basis of its characteristics that are used for designing data
architecture for processing and analytics.

 Below table gives various classification methods for data and Big Data.

5
BIGDATA ANALYTICS (18CS72)

Big data Handling Techniques


Following are the techniques for Big Data storage, applications, data management, mining and
analytics:

* Huge data volumes storage, data distribution, high speed networks and high performance
computing.

6
BIGDATA ANALYTICS (18CS72)

* Application scheduling using Open source, reliable, scalable, distributed file system,
distributed database, parallel and distributed computing systems such as Hadoop or Spark.

* Open source tools which are scalable, elastic and provide virtualized environment.

* Data management using NoSQL, document database, Column oriented database.

* Data mining and analytics, data retrieval, data reporting, Machine learning Big data tools.

Scalability and Parallel Processing


 Big Data needs processing of large data volume, therefore needs intensive computations.

 Processing complex applications with large datasets need hundreds of computing nodes.

 Processing of this much distributed data within a short time and at minimum cost is
problematic.

 Big Data can co-exist with traditional data store. Traditional data store usr RDBMS tables or
data warehouse.

 Big data processing and analytics requires scaling up and scaling out, both vertical and
horizontal computing resources.

 Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.

Analytics Scalability to Big Data


Vertical Scalability: Scaling up the given resources and increasing the system’s analytics,
reporting and visualization capabilities. This is an additional way to solve problems of greater
complexities.

Scaling up: Designing the algorithm according to the architecture that uses resources
efficiently.

Horizontal scalability: Increasing the number of systems working in coherence and scaling out
the work load.

Scaling out: Scaling out means using more resources and distributing the processing and storage
tasks in parallel.

7
BIGDATA ANALYTICS (18CS72)

 The easiest way to scale up and scale out execution of analytics software is to implement it
on a bigger machine with more CPUs for greater volume, velocity, variety and complexity
of data.

 Alternative ways for scaling up and scaling out processing of analytics software are
Massively Parallel Processing Platforms(MPPs), cloud, Grid, Clusters and distributed
computing software.

 Massively Parallel Processing platforms(MPPs)


 Scaling uses massively parallel processing systems. Many programs are so large and
complex that it is impossible to execute them on a single computer systems.

 Required to enhance (scale) up the computer system or use massively parallel


processing(MPPs) platforms.

 Distributed Computing Model


 A distributed computing model uses cloud, grid and clusters, which process and analyze big
and large datasets on distributed computing nodes connected by high speed networks.

Fig: Distributed computing paradigms

 Above table gives requirements of processing and analyzing big, large and small to medium
datasets on distributed computing nodes.

 Cloud Computing

8
BIGDATA ANALYTICS (18CS72)

 Cloud computing is a type of Internet based computing that provides shared processing
resources and data to the computers and other devices on demand.

 One of the best approach for data processing is to perform parallel and distributed
computing in a cloud computing environment.

 Cloud resources can be Amazon Web Services(AWS), Elastic Compute Cloud(EC2),


Microsoft Azure

 Cloud computing features are On demand service, Resource pooling, scalability,


accountability, Broad network access.

 Cloud services can be classified into 3 types:

1. Infrastructure as a Service(Iaas): Providing access through resources such as hard disks,


network connections, databases storage, data center are the Infrastructure as a Service.

2. Platform as a Service(PaaS): It implies providing runtime environment to allow


developers to build applications and services, which means cloud platform as a service.
Examples are Hadoop Cloud Service.

3. Software as a Service(SaaS): Providing software applications as a service to end-users is


known as Software as a Service.

 Grid and Cluster computing


1. Grid Computing

 Grid computing refers to distributed computing, in which a group of computers from several
locations are connected with each other to achieve a common tasks.

 A group of computers that might spread over remotely comprise a grid. A grid is used for
variety of purposes.

 Single grid dedicates to a particular application only.

 Grid computing provides large scale resource sharing which is flexible.

 Features: Scalable, Grid forms a distributed network for resource integration.

 Drawbacks: Single point failure, While sharing the resources the increase in the load
capacity.

2. Cluster Computing

 A cluster is a group of computers connected by a network. The group works togather to


accomplish the same task.

9
BIGDATA ANALYTICS (18CS72)

 Clusters are used mainly for load balancing.

 Volunteer Computing
 Volunteer provide computing resources to projects of importance that use resources to do
distributing computing or storage.

 Volunteer computing is a distributed computing paradigm which uses computing resources


of the volunteers.

 Issues: Volunteered computers heterogeneity , Drop outs from the network over time, Their
sporadic availability, Incorrect results at volunteers.

Designing Data Architecture

10
BIGDATA ANALYTICS (18CS72)

 Big Data architecture is the logical and/or physical layout of how big data will be stored,
accessed and managed within a Big Data or IT environment.

 Above figure shows the logical layers and the functions which are considered in Big Data
Architecture.

 Five vertically aligned text boxes on the left side of the above figure shows the layers.

 Data processing architecture consists of 5 layers:

1. Identification of data sources

2. Acquisition, ingestion, extraction, pre-processing, transformation of data

3. Data storage

4. Data processing

5. Data consumption

L1 considers the following aspects in a design:

 Amount of data needed at ingestion layer 2(L2)

11
BIGDATA ANALYTICS (18CS72)

 Push from L1 or pull by L2 as per the mechanism for the usages

 Source data-types : Database, Files, web or service

 Sources formats(Semi structured, Unstructured, Structured)

L2 considers the following aspects in a design:

 Ingestion and ETL processes either in real time, which means store and use the data as
generated, or in batches.

 Batch processing is using discrete datasets at scheduled or periodic intervals of time.

L3 considers the following aspects in a design:

 Data storage type, format, compression, incoming data frequency,querying patterns and
consumption requirements for L4 or L5.

 Data storage using Hadoop distributed File System(HDFS) or NoSQL data stores - HBase,
Cassandra, MongoDB.

L4 considers the following aspects in a design:

 Data processing software such as MapReduce, Hive, pig, Spark, Spark Streaming.

 Processing in scheduled batches or real time or hybrid.

 Processing as per synchronous or asynchronous processing requirements at L5

L5 considers the following aspects in a design:

 Data integration

 Datasets usages for reporting and visualization

 Analytics, Business Processes(Bps), Knowledge discovery.

 Export of datasets to cloud, web or other systems.

Managing Data for analysis


 Data managing means enabling, controlling, protecting, delivering and enhancing the value
of data and information asset.

 Data management functions include:


1. Data assets creation, maintenance and protection.

12
BIGDATA ANALYTICS (18CS72)

2. Data governance, which includes establishing the processes for ensuring the
availability of data.

3. Data architecture administration and management system.


4. Data mining and analytics algorithms
5. Data warehouse and management.
6. Data collection using the ETL process.

7. Data and application integration.

8. Managing the data quality.

9. Maintenance of business intelligence.

Data Sources, Quality, Pre-processing and Storing


Data Sources:

 Applications, programs and tools use the data.

 Data sources can be external such as sensors, trackers, web logs, computer systems logs and
feeds.

 Data sources can be internal such as databases, relational db, flat files, spreadsheets, mail
server

Structured data sources:

 Data source for storage and processing can be file, database or streaming data.

 Examples for structured data sources are SQL server, MySQL,Microsoft access Database
oracle DBMS

 A data source name should be meaningful name. For example, a name which identifies the
stored data in student grades during processing. The data source name could be
StudentName_Data_Grades.

 A data dictionary enables references for access to data. The dictionary stores at a central
location.

 The name of the dictionary can be UniversityStudents_DataPlusGrades.

 Microsoft applications consider 2 types of sources for processing such as Machine sources
and File sources.

13
BIGDATA ANALYTICS (18CS72)

1. Machine sources are present on computing nodes such as servers.

2. File servers are stored files.

 Data sources can point to:

1. A database in a specific location or in a data library of OS.

2. A specific machine in the enterprise that processes logic.

3. A data source master table.

Unstructured Data Sources

 Unstructured data sources are distributed over high speed networks. The data need high
velocity processing.

 The sources are of file types such as .txt(text file), .csv(Comma separated values file)

Data sources --- Sensors, Signals and GPS

 The data sources can be sensors, sensor networks, signals from machines, devices,
controllers.

 Sensors are the electronic devices that sense the physical environment. These are the devices
which are used for measuring temperature, pressure, humidity.

Data Quality
 Data quality refers to the state of qualitative and quantitative pieces of information.

 High quality means data which enables all the required operations, analysis, decisions,
planning and knowledge discovery correctly.

 A definition for data quality can be defined with 5R those are Relevancy, recency,
robustness,range and reliability.

Characteristics which affects the data quality

1. Data integrity

 Data integrity refers to the maintenance of consistency and accuracy in data over its usable
life.

 Software, which store, process, or retrieve the data, should maintain the data integrity of
data.

2. Noise

14
BIGDATA ANALYTICS (18CS72)

 One of the factors effecting data quality is noise.

 Noise in data refers to data giving additional meaningless information.

 Noisy data means data having large additional information.

3. Outliers

 A factor which effects quality is an outlier.

 An outlier in data refers to data, which appears to not belong to the data set.

 The outliers are the results of the human data entry errors.

4. Missing Values

 Another factor effecting data quality is missing values.

 Missing values implies data not appearing in the data set.

5. Duplicate Values

 Another factor effecting data quality is Duplicate values.

 Duplicate value implies the same data appearing two or more times in a data set.

Data Pre-processing
 Data pre-processing is an important step in the layer 2 of architecture.
 In the pre-processing step we will first remove the machine generated data like noise,
outlier, missing values, Duplicate values.

 Pre-processing is must before data mining and data analytics.


 Data when being exported to a cloud service needs pre-processing.
 Pre-processing involves
1. Dropping out of range, outlier values.

2. Filtering(Removing Noise)

3. Data cleaning,editing

4. Validation

5. ELT processing(Extract, Load, Transform)

Data Cleaning

15
BIGDATA ANALYTICS (18CS72)

 Refers to the process of removing or correcting incomplete,incorrect or irrelevant parts of


the data.

Data Enrichment

 Data enrichment refers to operations or processes which refine, enhance or improve the raw
data.

Data Editing

 Data editing refers to the process of reviewing and adjusting the acquired data sets.

 Editing methods are Interactive, selective, automatic, aggregating and distribution.

Data Reduction

 Data reduction enables the transformation of acquired information into an ordered, correct,
and simplified form.

 The reduction uses the editing, scaling, coding, sorting, smoothening.

Data Wrangling

 It refers to the process of transforming and mapping the data. Results from analytics are then
appropriate and valuable.

Data formats used during Pre-processing

 Comma -separated values CSV

 Java Script Object Notation (JSON)

 Tag Length Value(TLV)

 Key-Value pairs

 Hash-key-value pairs

Data Store Export to Cloud

16
BIGDATA ANALYTICS (18CS72)

 Above figure shows the data pre-processing, data mining, analysis. Visualization and data
store.

 The data exports to cloud services.

 The results integrate at the enterprise server or data warehouse.

Cloud Services
 Cloud offers various services(IaaS, Paas, SaaS).
 These services can be accessed through a cloud client(client application), such as web
browser, SQL.

 Below figure shows the data store export from machines, files, computers, web servers and
web services.

17
BIGDATA ANALYTICS (18CS72)

Big Query
 Google cloud platform provides cloud services called Big Query.
 The data exports from a table or partition schema, JSON, CSV, or AVRO files fromdata
sources after the pre-processing.

 Data store first pre-processes from machine and file data sources.
 Cloud service BigQuery consists of bigquery.tables.create, bigquery.dataEditor,
bigquery.dataOwner.

 Below figure shows the BigQuery cloud service at Google cloud platform.

18
BIGDATA ANALYTICS (18CS72)

Data Storage and Analysis


 Here we will discuss about the comparison between the Big data management tools and the
Traditional management systems.

Data storage and Management: Traditional Systems

Data storage with Structured or Semi structured Data

 Traditional system use structured or Semi structured data.

 Sources of structured data are Databases, RDBMS, Tables, MySQL DB

 Sources of Semi structured data are XML, JSON.

SQL

 Relational Database Management System(RDBMS) uses SQL.

 SQL is mainly used for the updating and viewing the database tables.

 SQL is mainly used for the data modification.

 SQL can embed within other languages using other SQL models.

 Functions of the SQL are:

a) Create schema: Which contains the description of the objects(base tables, views)
created by users.

b) Create catalog: Which consist of a set of schemas which describes the database.

19
BIGDATA ANALYTICS (18CS72)

c) Data Definition Language(DDL): Which includes creating, altering and dropping of


tables and establishes the constraints.

d) Data Manipulation Language(DML): A user can manipulate (INSERT/UPDATE) and


access (SELECT) the data.

e) Data control Language (DCL): Commands that control a database and provides the
administrative privileges.

Large data storage using RDBMS

 RDBMS stored data in a structured format.

 The tables will store the data in rows and columns format.

 In storing process the data management will involves privacy, security, and Data integration.

 The system uses machine generated data,Business process, Business intelligence.

 Set of keys will be used for access the data fields. Ex: (primary key, foreign key)

 Queries are used for insert, delete, modify the


fields.(INSERT,DELETE,UPDATE,SELECT)

Distributed Database Management System

 Distributed database is a collection of logically interrelated database at multiple system over


a computer network.

 Features of a distributed database system:

a) A collection of logically related databases.

b) Cooperation between databases in a transparent manner.Transparent means that each


user within the system may access all of the data within all the database as if they were a
single database.

c) Should be “location independent” which means the user is unaware of where the data
is located, and it is possible to move the data from one physical location to another
without affecting the user.

In-Memory Column Formats Data

 It allows faster data retrieval when only a few columns in table need to be selected during
query processing or aggregation.

 Data in a column are kept together in-memory in columnar format.

20
BIGDATA ANALYTICS (18CS72)

 Online Analytical Processing(OLAP) in real time transaction is fast when using In-Memory
Col format tables.

 Online Analytical Processing will be used for the viewing and visualization of data.

In-Memory Row Format Databases

 A row format in-memory allows much faster data processing during OLTP(Online
Transaction Processing).

Enterprise Data-Store server and Data Warehouse

Steps 1 to 5 in enterprise data integration and management with Big data

 After pre-processing and cleaning the data will be integrated on a data warehouse.

 Enterprise data server use data from several distributed sources which store data using
various technologies.

 All data will be merged using an integration tool. (Refer above figure)

 Following are some of the Standardized business processes:

1. Integrating and enhancing the existing systems and processes

21
BIGDATA ANALYTICS (18CS72)

2. Business Intelligence(BI)

3. Data security and Integrity

4. New business services(Web services)

5. E-commerce

6. External customer service

7. Data center optimization

8. Supply chain automation

Big Data Storage


Big Data NoSQL (Not only SQL)

 NoSQL databases are considered as semi-structured data. Big data store uses NoSQL.

 Features of NoSQL

1. It is a Non RDMS and flexible data models and multiple schema:

* Class consisting of uninterrupted key/value

* Class consisting of unordered keys and using JSON

* Class consisting of ordered keys semi-structured data storage systems

* Class consisting of JSON(MongoDB)

* May not use the fixed table schema

* Do not use the JOINS

*May relax the ACID rules during the data store transactions(
Atomicity,Consistency, Isolation, Durability).

Coexistence of Big Data, NoSQL and Traditional Data Stores

 Below figures shows the co-existence of data at server, RDBMS with NoSQL and Big Data
at hadoop, spark, Mesos and various data sources for Big Data along with its examples of
usages and the tools used respectively.

22
BIGDATA ANALYTICS (18CS72)

23
BIGDATA ANALYTICS (18CS72)

Big Data Platforms


 Big data platform supports large data sets or the large volume of data.

 The data generate at high speed because of this Big Data requires large resources of MPPs,
Cloud, Parallel processing.

 Big data platform should provide tools and services for:

1) Storage, processing and analysis.

2) Developing,deploying, operating and managing Big data environment

3) Custom development

4) Querying

5) The traditional as well as big data techniques

 Requirements for capturing Big data are as follows:

1) New innovative non traditional methods of storage

2) Massive parallelism

3) Huge volume of data store

4) High speed of networks

5) High performance computing, optimization and tuning

Hadoop

 Big Data platform consist of Big Data storage(S), Servers(S), and data management and
Business Intelligence(BI) software

 Storage can deploy HDFS, NoSQL data stores such as HBase, MongoDB, Cassendra.

 Hadoop is an open source frame work for the storage of large amount of data.

 Hadoop is scalable, reliable parallel computing platform.

 Manages Bigdata distributed database system.

 Below figure shows the Hadoop based Big Data Environment.

24
BIGDATA ANALYTICS (18CS72)

Mesos

 Mesos v0.9 is a resource management platform which enables sharing of cluster of nodes by
multiple frame works.

 Compatible with an open analytics stack[Data processing(Hive,Hadoop), data


management(HDFS)]

Big Data Stack

 A stack consist of a set of software components and data store units.

 Application, Machine learning algorithms, analytics tools use Big Data Stack(BDS) at a
cloud service.

25
BIGDATA ANALYTICS (18CS72)

Big Data Analytics


 DBMS and RDBMS manages the traditional DBs
 Analysis needs the pre-processing of the raw data and gives the information for decision
making.

 Analysis brings the correct order, structured and meaning to the collection of the data
Data Analytics Definition

 Statistical and mathematical data analysis that predicts the future possibilities.

 “Analysis of data is a process of inspecting,cleaning, transforming and modelling data with


the goal of discovering useful information, suggesting conclusions and supporting decision
making”

Phases in Analytics

There are 4 phases in analytics

1) Descriptive analytics

 Descriptive analytics is a statistical method that is used to search and summarize historical
data in order to identify patterns or meaning.

 Data aggregation and data mining are two techniques used in descriptive analytics to
discover historical data. Data is first gathered and sorted by data aggregation in order to
make the datasets more manageable by analysts.

 Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning. Identified patterns are analyzed to discover the specific ways
that learners interacted with the learning content and within the learning environment.

Advantages:

 Quickly and easily report on the Return on Investment (ROI) by showing how performance

achieved business or target goals.

 Identify gaps and performance issues early - before they become problems.

 Identify specific learners who require additional support, regardless of how many students

26
BIGDATA ANALYTICS (18CS72)

or employees there are.

 Identify successful learners in order to offer positive feedback or additional resources.

 Analyze the value and impact of course design and learning resources.

2) Predictive analytics

 Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviors.

 For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as specialized
software. For the learner, predictive forecasting could be as simple as a dashboard located on
the main screen after logging in to access a course.

 Analyzing data from past and current progress, visual indicators in the dashboard could be
provided to signal whether the employee was on track with training requirements.

Advantages:

 Personalize the training needs of employees by identifying their gaps, strengths, and

weaknesses; specific learning resources and training can be offered to support individual

needs.

 Retain Talent by tracking and understanding employee career progression and forecasting

what skills and learning resources would best benefit their career paths. Knowing what skills

employees need also benefits the design of future training.

 Support employees who may be falling behind or not reaching their potential by offering

intervention support before their performance puts them at risk.

 Simplified reporting and visuals that keep everyone updated when predictive forecasting

27
BIGDATA ANALYTICS (18CS72)

is required.

Prescriptive analytics

 Prescriptive analytics is a statistical method used to generate recommendations and make


decisions based on the computational findings of algorithmic models.

 Generating automated decisions or recommendations requires specific and unique


algorithmic models and clear direction from those utilizing the analytical technique. A
recommendation cannot be generated without knowing what to look for or what problem is
desired to be solved. In this way, prescriptive analytics begins with a problem.

Cognitive Analytics

 Enables derivation of the additional value and undertake better decisions.

Traditional and Big Data analytics architecture reference model

 Above diagram illustrates the reference model for the Traditional and Big data analytics.

 For storing large amount of data it requires the well-proven techniques to calculate,plan and
analyze.

Berkeley Data Analytics Stack(BDAS)

28
BIGDATA ANALYTICS (18CS72)

 First we need to understand whether gathered data is able to help in obtaining the following
results are not:

1) Cost reduction

2) time reduction,

3) new product planning

4) knowledge discovery

5) Smart decision making

 BDAS is an open source data analytics stack for complex computations on Big data.

 It supports efficient, large scale in-memory data processing and thus enables user
applications achieving 3 fundamental processing requirements those are accuracy, time and
cost.

 BDAS consists of Data processing, Data management and resource management layers.

1) Applications, AMP-Genomics and Carat run at the BDAS. Data processing software
components provides in-memory processing which processes the data efficiently across
the frameworks. AMP stands for Berkeley’s algorithms, Machines and peoples
laboratory.

2) Data processing combines batch, Streaming and interactive computations.

3) Resource management software component provides for sharing the infrastructure


across various frameworks.

Big Data Analytics Applications and Case Studies


 Many applications such as social network, social media,, cloud applications, public and
commercial web sites, scientific experiments, simulators and many more.

1) Big Data in Marketing and Sales

 Data are important for most aspect of marketing, sales and advertising.

 Customer Value(CV) depends on the 3 factors: quality,prise,service.

 Customer Value Analytics(CVA) is the tool for analyzing what a customer really needs.

 CVA really makes it possible for leading marketers like amazon,flip-cart etc.

1) Big Data Analytics in Detecting of Marketing Frauds

* To prevent financial losses we need to detect the frauds

29
BIGDATA ANALYTICS (18CS72)

* Features or the ways to detecting and preventing of frauds are:

- Collection of the existing data at the enterprise or at the data warehouses

- Using multiple sources of data and connecting with many other


application.

- Providing Multiple Sources

- Analyzing the structured data

2) Big Data Risks

* Large volume and velocity of big data provides greater opportunities but it
results some risks like less accurate and low quality of data.

3) Big Data credit Risk management

* Financial institutions includes banks,loan system.

* Financial institutions are interested in

- Identifying high credit rating business groups and individuals

- Identifying risk involved before lending money

- Identifying industrial sectors with greater risks.

- Identifying types of employees

2) Big Data and Health Care

 Bigdata analytics in health care use the following data sources : Clinical records,Pharmacy
records,Electronic medical records,Diagnosis logs and notes.

 Healthcare analytics using Big data can facilitate the following:

* Provisioning of value based and customer-centric

* Utilizing the IoT for health care

* Improving health care outcomes

* Monitoring patients in real time.

 Value based and customer centric health care: Cost effective patient care by improving
healthcare quality using latest knowledge, usages of electronic health and medical records.

30
BIGDATA ANALYTICS (18CS72)

 Health care Internet of Things: Which creates the unstructured data. The data enables the
monitoring of the devices data for patient parameters, such as glucose,BP and ECGs.

 Prevention of Fraud, waste, and abuse: Uses big data predictive analytics and help
resolve excessive or duplicate claims in a systematic manner.

 Patient Real time monitoring: Uses machine learning algorithms which process real time
events.

3) Big Data in Medicine

 Big data offers potential to transform medicine and health care system.

 Aggregating large volume and variety of information around from multiple sources the
DNAs, proteins, and metabolites to cells, tissues, organs, that can understand the biology of
diseases.

4) Big Data in Advertising

 The impact of big data is tremendous on the digital advertising industry.

 Digital advertising industry sends advertisements using SMS, e-mails,WhatsApp, LinkedIn


and other mediums.

 Success from advertisements depend on collection,analyzing and mining.

 Advertising on digital medium needs Optimization. Too much usage also effect negatively.

 Phone calls,e-mails,SMS based advertisements can be nuisance if sent without appropriate


research on the potential targets.

31

You might also like