Data Analysis _Unit1

The document outlines a course on Data Analytics, detailing its structure, prerequisites, and the significance of data analytics in various sectors. It covers different types of data (structured, semi-structured, and unstructured), characteristics of data, and the importance of big data, including its 5Vs: Volume, Velocity, Variety, Veracity, and Value. Additionally, it describes the data analytics process, modern tools, and a case study on improving employee engagement through data analysis.

Uploaded by

abhishekpandey4517

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Data Analysis _Unit1

Uploaded by

abhishekpandey4517

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

UNIT I

Data Analytics

Course Instructor
Dr. Himanshu Rai
Data Analytics (KIT-601)
 Full Credit Course
4 Credit
150 marks
External-100
Internal - 50

 Syllabus Page No: 23(slide 5)

Data Analytics Lab (KIT-651)
 Credit
1 Credit
50 marks
External-25
Internal - 25
Prerequisite
 Statistics
 Mean, Median, Mode, Quartile
 Standard Deviation
 Probability distribution
 Matrix operations

 Vector Algebra
 Dot & cross product of Vectors
Introduction
 The importance of data analytics in any sector is compounded,
creating enormous quantities of knowledge that can provide
useful insights into the field. In the last ten years, this has led
to a surge in the data market.
 In order to gain decision-making insights, the compilation of
data can be supplemented by its analysis. Data analytics help
organizations and businesses gain insight into the enormous
amount of knowledge they need for further production and
growth.
What Is data ?
 Data is a collection of facts, such
as numbers, words,
measurements, observations or
just descriptions of things.
Why ?
Classification of Data
Structured Data
 Structured data is data whose elements are addressable for
effective analysis.
 It has been organized into a formatted repository that is
typically a database.
 Today, those data are most processed in the development and
simplest way to manage information. Example: Relational
data.
Structured Data
Examples Of Structured Data
An 'Employee' table in a database
Semi-Structured data
 Semi-structured data is a form of structured data that does
not obey the tabular structure of data models associated with
relational databases or other forms of data tables, but
nonetheless contains tags or other markers to separate
semantic elements and enforce hierarchies of records and
fields within the data.
 With some process, you can store them in the relation
database.
Example: XML data.
Semi-Structured data
 Examples Of Semi-structured Data
Personal data stored in an XML file-
Unstructured data
 Unstructured data is a data which is not organized in a
predefined manner or does not have a predefined data model.
 For Unstructured data, there are alternative platforms for
storing and managing,
 It is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications.
Example: Word, PDF, Text, Media logs.
Unstructured data
Differences
Differences
Characteristic of Data
The Nine characteristics that define data are:
1. Accuracy and Precision
2. Completeness and Comprehensiveness
3. Reliability and Consistency
4. Relevance
5. Timeliness
6. Objectivity
7. Granularity
8. Availability and Accessibility
9. Confidentiality
Characteristic of Data
1. Accuracy: Data should be accurate, meaning that it is a true
representation of the real-world phenomenon it is intended to measure.

2. Completeness: Data should be complete, meaning that it contains all

the necessary information required for the analysis or interpretation of
the phenomenon.

3. Consistency: Data should be consistent, meaning that it is free from

contradictions or errors that might lead to invalid conclusions.

4. Relevance: Data should be relevant, meaning that it is directly related to

the research question or problem being investigated.
Characteristic of Data
5. Timeliness: Data should be timely, meaning that it is current
and up-to-date and has been collected within an appropriate
time frame.

6. Objectivity: Data should be objective, meaning that it is free

from bias or subjectivity that might influence the
interpretation or analysis of the data.

7. Granularity: Data should have an appropriate level of detail or

granularity to support the analysis or interpretation of the
phenomenon being studied.
Characteristic of Data
8. Accessibility: Data should be easily accessible, meaning that
it is available in a format and location that can be easily
accessed by the people who need it.

9. Confidentiality: Data should be kept confidential and secure

to protect the privacy and security of the individuals or
organizations that are the source of the data.
Introduction to Big Data
Platform
Introduction to Big Data platform

 Big Data is a collection of data that is huge in volume, yet

growing exponentially with time.
 It is a data with so large size and complexity that none of
traditional data management tools can store it or process it
efficiently.
Examples Of Big Data

 Stock Exchange: The New York Stock Exchange generates

about one terabyte of new trade data per day
 Social Media: The statistic shows that 500+terabytes of new
data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting
comments etc
Examples
Jet Engine: A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many thousand flights
per day, generation of data reaches up to many Petabytes.
Big Data’s 5V
VOLUME
 The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of
data.
 Whether a particular data can actually be considered as a Big
Data or not, is dependent upon the volume of data.
VELOCITY
 The term 'velocity' refers to the speed of generation of data.
How fast the data is generated and processed to meet the
demands, determines real potential in the data.
 Big Data Velocity deals with the speed at which data flows in
from sources like business processes, application logs,
networks, and social media sites, sensors, mobile devices, etc.
The flow of data is massive and continuous.
VARIETY
 Variety refers to heterogeneous sources and the nature of
data, both structured and unstructured.
 During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications.
 Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being considered
in the analysis applications.
 This variety of unstructured data poses certain issues for
storage, mining and analyzing data.
VERACITY
 It refers to the quality of data.
 We have all the data, but inconsistencies and uncertainty in
data is a major challenge.
VALUE
 The bulk of data having no value is of no use to the
organizations.
 It needs to be converted into something valuable to extract
information.
What is Data Analytics ?
 Data analytics is the science of analyzing raw data in order to
make conclusions about that information. Many of the
techniques and processes of data analytics have been
automated into mechanical processes and algorithms that
work over raw data for human consumption.
 Data analytics techniques can reveal trends and metrics that
would otherwise be lost in the mass of information. This
information can then be used to optimize processes to
increase the overall efficiency of a business or system.
The Process in Data Analysis
The process involved in data analysis involves several different steps:
1. The first step is to determine the data requirements or how the data is
grouped. Data may be separated by age, demographic, income, or gender. Data
values may be numerical or be divided by category.
2. The second step in data analytics is the process of collecting it. This can be done
through a variety of sources such as computers, online sources, cameras,
environmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be analyzed.
Organization may take place on a spreadsheet or other form of software that
can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and
checked to ensure there is no duplication or error, and that it is not incomplete.
This step helps correct any errors before it goes on to a data analyst to be
analyzed
Why Data Analytics Matters?
1. Data analytics is important because it helps businesses
optimize their performances. Implementing it into the
business model means companies can help reduce costs by
identifying more efficient ways of doing business and by
storing large amounts of data.
2. A company can also use data analytics to make better
business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products
and services.
3. Data analytics help a business optimize its performance.
Need of Data Analytics
Evolution of the Analytic
scalability

Table : Measurement of Data Size

Continue..
MPP

Figure. Execution of tasks in MPP

Evolution of Technologies
Data Analysis Process
Analytics Process Model
Reporting vs Analytics
Big Data Tool
 Hdoop
 Hive
 Hbase
 sqoop
Modern Data Analytics Tools
 R Programing
 Tableau Public
 Python
 SAS
 Apache Spark
 Excel
 RapidMiner
Modern Data Analytical Tools
Application
1. Security
2. Transportation
3.Agriculture
4.Fast internet allocation
5. Banking
6. Interaction with customers
7. Planning of cities
8. Healthcare
Data Analytics
Lifecycle
Data
Analytics

Lifecycle
Phase 1: Discovery
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics
Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data
Sources
Phase 2: Data Preparation
Phase 2: Data Preparation
Steps to explore, preprocess, and
condition data prior to modeling and
analysis.
It requires the presence of an analytic
sandbox, the team execute, load, and
transform, to get

data into the sandbox.

Data preparation tasks are likely to be

performed multiple times and not in
predefined order.
Several tools commonly used for this
phase are – Hadoop, Alpine Miner, Open
Refine, etc.
Phase 3: Model Planning
Team explores data to learn about relationships
between variables and subsequently, selects key
variables and the most suitable models.
 Assess the structure of the data – this
dictates the tools and analytic techniques
for the next phase
 Ensure the analytic techniques enable the
team to meet the business objectives and
accept or reject the working hypotheses
 Determine if the situation warrants a single
model or a series of techniques as part of a
larger analytic workflow
 Research and understand how other
analysts have approached this kind or
similar kind of problem
 Several tools commonly used for this phase
are – Matlab, STASTICA,R.
Phase 4: Model Building
 Execute the models defined in Phase 3
 Develop datasets for training, testing, and
production
 Develop analytic model on training data, test on
test data
 Question to consider
 Does the model appear valid and accurate on the test data?
 Does the model output/behavior make sense to the domain
experts?
 Do the parameter values make sense in the context of the
domain?
 Is the model sufficiently accurate to meet the goal?
 Does the model avoid intolerable mistakes? (see Chapters 3
and 7)
 Are more data or inputs needed?
 Will the kind of model chosen support the runtime
environment?
 Is a different form of the model required to address the
business problem?
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results
 After executing model team need to
compare outcomes of modeling to
criteria established for success and
failure.
 Team considers how best to articulate
findings and outcomes to various team
members and stakeholders,
considering warning, assumptions.
 Team should identify key findings,
quantify business value, and develop
narrative to
 summarize and convey findings to
stakeholders.
Phase 6: Operationalize
 In this last phase, the team communicates the
benefits of the project more broadly and sets up
a pilot project to deploy the work in a controlled
way
 Risk is managed effectively by undertaking small
scope, pilot deployment before a wide-scale
rollout
 During the pilot project, the team may need to
execute the algorithm more efficiently in the
database rather than with in-memory tools like
R, especially with larger datasets
 To test the model in a live setting, consider
running the model in a production environment
for a discrete set of products or a single line of
business
 Monitor model accuracy and retrain the model
if necessary
 Free or open-source tools – Octave, WEKA, SQL,
MADlib.
Case Study: Global Innovation Network
and Analysis (GINA)
 In 2012 EMC’s new director wanted to improve the company’s
engagement of employees across the global centers of
excellence (GCE) to drive innovation, research, and university
partnerships
 This project was created to accomplish
 Store formal and informal data
 Track research from global technologists
 Mine the data for patterns and insights to improve the team’s
operations and strategy
Phase 1: Discovery

 Team members and roles

 Business user, project sponsor, project manager –
Vice President from Office of CTO
 BI analyst – person from IT
 Data engineer and DBA – people from IT
 Data scientist – distinguished engineer
Phase 1: Discovery
 The data fell into two categories
 Five years of idea submissions from internal innovation
contests
 Minutes and notes representing innovation and research
activity from around the world
 Hypotheses grouped into two categories
 Descriptive analytics of what is happening to spark further
creativity, collaboration, and asset generation
 Predictive analytics to advise executive management of
where it should be investing in the future
Phase 2: Data
Preparation
 Set up an analytics sandbox
 Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
 Team recognized that poor quality data could impact subsequent
steps
 They discovered many names were misspelled and problems with
extra spaces
 These seemingly small problems had to be addressed
Phase 3: Model Planning

 The study included the following considerations

 Identify the right milestones to achieve the goals
 Trace how people move ideas from each milestone
toward the goal
 Tract ideas that die and others that reach the goal
 Compare times and outcomes using a few different
methods
Phase 4: Model Building

 Several analytic method

were employed
 NLP on textual descriptions
 Social network analysis
using R and Rstudio
 Developed social graphs
and visualizations
Phase 4: Model Building
Social graph of top innovation
influencers
Phase 5: Communicate
Results
 Study was successful in in identifying hidden
innovators
 Found high density of innovators in Cork, Ireland
 The CTO office launched longitudinal studies
Phase 6: Operationalize
 Deployment was not really
discussed
 Key findings
 Need more data in future
 Some data were sensitive
 A parallel initiative needs to be
created to improve basic BI
activities
 A mechanism is needed to
continually reevaluate the
model after deployment

VCS Internet Connectivity Forms 2023
No ratings yet
VCS Internet Connectivity Forms 2023
2 pages
L01-Fundamentals of Big Data and Data Analytics (1)
No ratings yet
L01-Fundamentals of Big Data and Data Analytics (1)
58 pages
1 - Konsep Big Data
No ratings yet
1 - Konsep Big Data
35 pages
Unit 1 Introduction to Data Analytics
No ratings yet
Unit 1 Introduction to Data Analytics
20 pages
dataanalyticsunit-1[1]
No ratings yet
dataanalyticsunit-1[1]
26 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Unit 1
No ratings yet
Unit 1
61 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
DA Merge Notes(30!09!24)
No ratings yet
DA Merge Notes(30!09!24)
348 pages
Bda Unit 1
No ratings yet
Bda Unit 1
74 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
Unit - I DA.pptx
No ratings yet
Unit - I DA.pptx
107 pages
unit-1ppt-241202105748-ba1c594f
No ratings yet
unit-1ppt-241202105748-ba1c594f
30 pages
Unit 1
No ratings yet
Unit 1
19 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
Lecture.pptx
No ratings yet
Lecture.pptx
46 pages
Da Unit-1
No ratings yet
Da Unit-1
23 pages
Data Analysis - Version 2
No ratings yet
Data Analysis - Version 2
12 pages
Data Analytics-Wps Office
No ratings yet
Data Analytics-Wps Office
21 pages
DA (1)
No ratings yet
DA (1)
86 pages
Week-1-Lecture
No ratings yet
Week-1-Lecture
26 pages
BDT 1
No ratings yet
BDT 1
49 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
Big Data
No ratings yet
Big Data
22 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
BI module 2 (1)
No ratings yet
BI module 2 (1)
11 pages
Data For Business Analytics Unit 2
No ratings yet
Data For Business Analytics Unit 2
23 pages
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
No ratings yet
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
13 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
9 pages
UNITWISE-IMP-NOTES
No ratings yet
UNITWISE-IMP-NOTES
34 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Unit 1
No ratings yet
Unit 1
59 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
72 pages
Unit 2
No ratings yet
Unit 2
35 pages
Big Data
No ratings yet
Big Data
28 pages
Lesson 5 - Business Analytics and Big Data
No ratings yet
Lesson 5 - Business Analytics and Big Data
39 pages
Untitled Document-1
No ratings yet
Untitled Document-1
3 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
Module 3
No ratings yet
Module 3
137 pages
BUSINESS ANALYTICS UNIT I
No ratings yet
BUSINESS ANALYTICS UNIT I
45 pages
Data Analytics III-i
No ratings yet
Data Analytics III-i
85 pages
Module 1
No ratings yet
Module 1
21 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
63 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Unit 1
No ratings yet
Unit 1
10 pages
3 Data Analytics Techniques
No ratings yet
3 Data Analytics Techniques
17 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
AA THeory and Methods
No ratings yet
AA THeory and Methods
40 pages
Summary_ Introduction to Data Analytics (2)-3978
No ratings yet
Summary_ Introduction to Data Analytics (2)-3978
7 pages
Data Analytics with Python: Data Analytics in Python Using Pandas
From Everand
Data Analytics with Python: Data Analytics in Python Using Pandas
Frank Millstein
3/5 (1)
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Tradition Unit 2
No ratings yet
Tradition Unit 2
104 pages
Node Js
No ratings yet
Node Js
53 pages
web Technology important questions 2025
No ratings yet
web Technology important questions 2025
6 pages
ASmusic Presentation[1]
No ratings yet
ASmusic Presentation[1]
24 pages
The Case For Chess
No ratings yet
The Case For Chess
11 pages
Similar Triangles r1 PDF
No ratings yet
Similar Triangles r1 PDF
34 pages
Ticket No # Call Type Call Registered by Call Date Branch Name
No ratings yet
Ticket No # Call Type Call Registered by Call Date Branch Name
20 pages
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
No ratings yet
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
6 pages
Hydraulics Chapter 3
No ratings yet
Hydraulics Chapter 3
73 pages
Lesson 29 Fourier Sine and Cosine Transform
No ratings yet
Lesson 29 Fourier Sine and Cosine Transform
6 pages
Regular and Irregular Plural
No ratings yet
Regular and Irregular Plural
8 pages
A User's Guide To The Z-Shell: Peter Stephenson
No ratings yet
A User's Guide To The Z-Shell: Peter Stephenson
335 pages
Luis III Ariel Garza Narváez Liceo #488, Centro, C.P. 44270, Guadalajara, Jalisco, México Tel. +52 - (33) - 3614-7838 Cel. +521 - (33) - 1358-0950
No ratings yet
Luis III Ariel Garza Narváez Liceo #488, Centro, C.P. 44270, Guadalajara, Jalisco, México Tel. +52 - (33) - 3614-7838 Cel. +521 - (33) - 1358-0950
2 pages
Investigatory Project
No ratings yet
Investigatory Project
12 pages
Steam Jet Refrigeration System Seminar
100% (2)
Steam Jet Refrigeration System Seminar
16 pages
0500 - m22 - QP - 12 Cambridge
No ratings yet
0500 - m22 - QP - 12 Cambridge
16 pages
Halamang Gamot Table
No ratings yet
Halamang Gamot Table
3 pages
Deloitte Reviews in Hyderabad, India Area: S A L A R I e S I N T e R V I e W S P H o T o S J o B S
No ratings yet
Deloitte Reviews in Hyderabad, India Area: S A L A R I e S I N T e R V I e W S P H o T o S J o B S
10 pages
English-Quarter 2 Week 3 Day 1: En20L-Iia-E1.3 En20L-Ii-C-D-1.3.4 En2Bpk-Iib-C4
No ratings yet
English-Quarter 2 Week 3 Day 1: En20L-Iia-E1.3 En20L-Ii-C-D-1.3.4 En2Bpk-Iib-C4
124 pages
Optical Projection Lithography
No ratings yet
Optical Projection Lithography
40 pages
CFD Simulations of Horizontal Wellbore Cleaning
No ratings yet
CFD Simulations of Horizontal Wellbore Cleaning
44 pages
The Draw-A-Person Test - Jonah Lehrer
No ratings yet
The Draw-A-Person Test - Jonah Lehrer
1 page
Basic Fleece Hat Pattern and Tutorial Fleece Fun
No ratings yet
Basic Fleece Hat Pattern and Tutorial Fleece Fun
26 pages
Grade 11 Exam Papers.
No ratings yet
Grade 11 Exam Papers.
13 pages
Outline POLI 1100 (HANLON) (2)
No ratings yet
Outline POLI 1100 (HANLON) (2)
8 pages
Q3 FY24 Earnings
No ratings yet
Q3 FY24 Earnings
23 pages
Ebooks File Ways of Living Work Community and Lifestyle Choice Paul Blyton All Chapters
100% (16)
Ebooks File Ways of Living Work Community and Lifestyle Choice Paul Blyton All Chapters
70 pages
Bee302 Electrical Measurements Instrumentation
No ratings yet
Bee302 Electrical Measurements Instrumentation
2 pages
Medal Log 20221204
No ratings yet
Medal Log 20221204
100 pages
Booklet For Chapter 8
No ratings yet
Booklet For Chapter 8
7 pages
An Introduction On Bioinformatics
No ratings yet
An Introduction On Bioinformatics
66 pages
UNEP The First 40 Years A Narrative by Stanley Johnson
No ratings yet
UNEP The First 40 Years A Narrative by Stanley Johnson
302 pages
Rock Music Class Notes
No ratings yet
Rock Music Class Notes
1 page