0% found this document useful (0 votes)
4 views

Data Analysis _Unit1

The document outlines a course on Data Analytics, detailing its structure, prerequisites, and the significance of data analytics in various sectors. It covers different types of data (structured, semi-structured, and unstructured), characteristics of data, and the importance of big data, including its 5Vs: Volume, Velocity, Variety, Veracity, and Value. Additionally, it describes the data analytics process, modern tools, and a case study on improving employee engagement through data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Analysis _Unit1

The document outlines a course on Data Analytics, detailing its structure, prerequisites, and the significance of data analytics in various sectors. It covers different types of data (structured, semi-structured, and unstructured), characteristics of data, and the importance of big data, including its 5Vs: Volume, Velocity, Variety, Veracity, and Value. Additionally, it describes the data analytics process, modern tools, and a case study on improving employee engagement through data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT I

Data Analytics

Course Instructor
Dr. Himanshu Rai
Data Analytics (KIT-601)
 Full Credit Course
4 Credit
150 marks
External-100
Internal - 50

 Syllabus Page No: 23(slide 5)


Data Analytics Lab (KIT-651)
 Credit
1 Credit
50 marks
External-25
Internal - 25
Prerequisite
 Statistics
 Mean, Median, Mode, Quartile
 Standard Deviation
 Probability distribution
 Matrix operations

 Vector Algebra
 Dot & cross product of Vectors
Introduction
 The importance of data analytics in any sector is compounded,
creating enormous quantities of knowledge that can provide
useful insights into the field. In the last ten years, this has led
to a surge in the data market.
 In order to gain decision-making insights, the compilation of
data can be supplemented by its analysis. Data analytics help
organizations and businesses gain insight into the enormous
amount of knowledge they need for further production and
growth.
What Is data ?
 Data is a collection of facts, such
as numbers, words,
measurements, observations or
just descriptions of things.
Why ?
Classification of Data
Structured Data
 Structured data is data whose elements are addressable for
effective analysis.
 It has been organized into a formatted repository that is
typically a database.
 Today, those data are most processed in the development and
simplest way to manage information. Example: Relational
data.
Structured Data
Examples Of Structured Data
An 'Employee' table in a database
Semi-Structured data
 Semi-structured data is a form of structured data that does
not obey the tabular structure of data models associated with
relational databases or other forms of data tables, but
nonetheless contains tags or other markers to separate
semantic elements and enforce hierarchies of records and
fields within the data.
 With some process, you can store them in the relation
database.
Example: XML data.
Semi-Structured data
 Examples Of Semi-structured Data
Personal data stored in an XML file-
Unstructured data
 Unstructured data is a data which is not organized in a
predefined manner or does not have a predefined data model.
 For Unstructured data, there are alternative platforms for
storing and managing,
 It is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications.
Example: Word, PDF, Text, Media logs.
Unstructured data
Differences
Differences
Characteristic of Data
The Nine characteristics that define data are:
1. Accuracy and Precision
2. Completeness and Comprehensiveness
3. Reliability and Consistency
4. Relevance
5. Timeliness
6. Objectivity
7. Granularity
8. Availability and Accessibility
9. Confidentiality
Characteristic of Data
1. Accuracy: Data should be accurate, meaning that it is a true
representation of the real-world phenomenon it is intended to measure.

2. Completeness: Data should be complete, meaning that it contains all


the necessary information required for the analysis or interpretation of
the phenomenon.

3. Consistency: Data should be consistent, meaning that it is free from


contradictions or errors that might lead to invalid conclusions.

4. Relevance: Data should be relevant, meaning that it is directly related to


the research question or problem being investigated.
Characteristic of Data
5. Timeliness: Data should be timely, meaning that it is current
and up-to-date and has been collected within an appropriate
time frame.

6. Objectivity: Data should be objective, meaning that it is free


from bias or subjectivity that might influence the
interpretation or analysis of the data.

7. Granularity: Data should have an appropriate level of detail or


granularity to support the analysis or interpretation of the
phenomenon being studied.
Characteristic of Data
8. Accessibility: Data should be easily accessible, meaning that
it is available in a format and location that can be easily
accessed by the people who need it.

9. Confidentiality: Data should be kept confidential and secure


to protect the privacy and security of the individuals or
organizations that are the source of the data.
Introduction to Big Data
Platform
Introduction to Big Data platform

 Big Data is a collection of data that is huge in volume, yet


growing exponentially with time.
 It is a data with so large size and complexity that none of
traditional data management tools can store it or process it
efficiently.
Examples Of Big Data

 Stock Exchange: The New York Stock Exchange generates


about one terabyte of new trade data per day
 Social Media: The statistic shows that 500+terabytes of new
data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting
comments etc
Examples
Jet Engine: A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many thousand flights
per day, generation of data reaches up to many Petabytes.
Big Data’s 5V
VOLUME
 The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of
data.
 Whether a particular data can actually be considered as a Big
Data or not, is dependent upon the volume of data.
VELOCITY
 The term 'velocity' refers to the speed of generation of data.
How fast the data is generated and processed to meet the
demands, determines real potential in the data.
 Big Data Velocity deals with the speed at which data flows in
from sources like business processes, application logs,
networks, and social media sites, sensors, mobile devices, etc.
The flow of data is massive and continuous.
VARIETY
 Variety refers to heterogeneous sources and the nature of
data, both structured and unstructured.
 During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications.
 Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being considered
in the analysis applications.
 This variety of unstructured data poses certain issues for
storage, mining and analyzing data.
VERACITY
 It refers to the quality of data.
 We have all the data, but inconsistencies and uncertainty in
data is a major challenge.
VALUE
 The bulk of data having no value is of no use to the
organizations.
 It needs to be converted into something valuable to extract
information.
What is Data Analytics ?
 Data analytics is the science of analyzing raw data in order to
make conclusions about that information. Many of the
techniques and processes of data analytics have been
automated into mechanical processes and algorithms that
work over raw data for human consumption.
 Data analytics techniques can reveal trends and metrics that
would otherwise be lost in the mass of information. This
information can then be used to optimize processes to
increase the overall efficiency of a business or system.
The Process in Data Analysis
The process involved in data analysis involves several different steps:
1. The first step is to determine the data requirements or how the data is
grouped. Data may be separated by age, demographic, income, or gender. Data
values may be numerical or be divided by category.
2. The second step in data analytics is the process of collecting it. This can be done
through a variety of sources such as computers, online sources, cameras,
environmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be analyzed.
Organization may take place on a spreadsheet or other form of software that
can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and
checked to ensure there is no duplication or error, and that it is not incomplete.
This step helps correct any errors before it goes on to a data analyst to be
analyzed
Why Data Analytics Matters?
1. Data analytics is important because it helps businesses
optimize their performances. Implementing it into the
business model means companies can help reduce costs by
identifying more efficient ways of doing business and by
storing large amounts of data.
2. A company can also use data analytics to make better
business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products
and services.
3. Data analytics help a business optimize its performance.
Need of Data Analytics
Evolution of the Analytic
scalability

Table : Measurement of Data Size


Continue..
MPP

Figure. Execution of tasks in MPP


Evolution of Technologies
Data Analysis Process
Analytics Process Model
Reporting vs Analytics
Big Data Tool
 Hdoop
 Hive
 Hbase
 sqoop
Modern Data Analytics Tools
 R Programing
 Tableau Public
 Python
 SAS
 Apache Spark
 Excel
 RapidMiner
Modern Data Analytical Tools
Application
1. Security
2. Transportation
3.Agriculture
4.Fast internet allocation
5. Banking
6. Interaction with customers
7. Planning of cities
8. Healthcare
Data Analytics
Lifecycle
Data
Analytics

Lifecycle
Phase 1: Discovery
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics
Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data
Sources
Phase 2: Data Preparation
Phase 2: Data Preparation
Steps to explore, preprocess, and
condition data prior to modeling and
analysis.
It requires the presence of an analytic
sandbox, the team execute, load, and
transform, to get

data into the sandbox.

Data preparation tasks are likely to be


performed multiple times and not in
predefined order.
Several tools commonly used for this
phase are – Hadoop, Alpine Miner, Open
Refine, etc.
Phase 3: Model Planning
Team explores data to learn about relationships
between variables and subsequently, selects key
variables and the most suitable models.
 Assess the structure of the data – this
dictates the tools and analytic techniques
for the next phase
 Ensure the analytic techniques enable the
team to meet the business objectives and
accept or reject the working hypotheses
 Determine if the situation warrants a single
model or a series of techniques as part of a
larger analytic workflow
 Research and understand how other
analysts have approached this kind or
similar kind of problem
 Several tools commonly used for this phase
are – Matlab, STASTICA,R.
Phase 4: Model Building
 Execute the models defined in Phase 3
 Develop datasets for training, testing, and
production
 Develop analytic model on training data, test on
test data
 Question to consider
 Does the model appear valid and accurate on the test data?
 Does the model output/behavior make sense to the domain
experts?
 Do the parameter values make sense in the context of the
domain?
 Is the model sufficiently accurate to meet the goal?
 Does the model avoid intolerable mistakes? (see Chapters 3
and 7)
 Are more data or inputs needed?
 Will the kind of model chosen support the runtime
environment?
 Is a different form of the model required to address the
business problem?
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results
 After executing model team need to
compare outcomes of modeling to
criteria established for success and
failure.
 Team considers how best to articulate
findings and outcomes to various team
members and stakeholders,
considering warning, assumptions.
 Team should identify key findings,
quantify business value, and develop
narrative to
 summarize and convey findings to
stakeholders.
Phase 6: Operationalize
 In this last phase, the team communicates the
benefits of the project more broadly and sets up
a pilot project to deploy the work in a controlled
way
 Risk is managed effectively by undertaking small
scope, pilot deployment before a wide-scale
rollout
 During the pilot project, the team may need to
execute the algorithm more efficiently in the
database rather than with in-memory tools like
R, especially with larger datasets
 To test the model in a live setting, consider
running the model in a production environment
for a discrete set of products or a single line of
business
 Monitor model accuracy and retrain the model
if necessary
 Free or open-source tools – Octave, WEKA, SQL,
MADlib.
Case Study: Global Innovation Network
and Analysis (GINA)
 In 2012 EMC’s new director wanted to improve the company’s
engagement of employees across the global centers of
excellence (GCE) to drive innovation, research, and university
partnerships
 This project was created to accomplish
 Store formal and informal data
 Track research from global technologists
 Mine the data for patterns and insights to improve the team’s
operations and strategy
Phase 1: Discovery

 Team members and roles


 Business user, project sponsor, project manager –
Vice President from Office of CTO
 BI analyst – person from IT
 Data engineer and DBA – people from IT
 Data scientist – distinguished engineer
Phase 1: Discovery
 The data fell into two categories
 Five years of idea submissions from internal innovation
contests
 Minutes and notes representing innovation and research
activity from around the world
 Hypotheses grouped into two categories
 Descriptive analytics of what is happening to spark further
creativity, collaboration, and asset generation
 Predictive analytics to advise executive management of
where it should be investing in the future
Phase 2: Data
Preparation
 Set up an analytics sandbox
 Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
 Team recognized that poor quality data could impact subsequent
steps
 They discovered many names were misspelled and problems with
extra spaces
 These seemingly small problems had to be addressed
Phase 3: Model Planning

 The study included the following considerations


 Identify the right milestones to achieve the goals
 Trace how people move ideas from each milestone
toward the goal
 Tract ideas that die and others that reach the goal
 Compare times and outcomes using a few different
methods
Phase 4: Model Building

 Several analytic method


were employed
 NLP on textual descriptions
 Social network analysis
using R and Rstudio
 Developed social graphs
and visualizations
Phase 4: Model Building
Social graph of top innovation
influencers
Phase 5: Communicate
Results
 Study was successful in in identifying hidden
innovators
 Found high density of innovators in Cork, Ireland
 The CTO office launched longitudinal studies
Phase 6: Operationalize
 Deployment was not really
discussed
 Key findings
 Need more data in future
 Some data were sensitive
 A parallel initiative needs to be
created to improve basic BI
activities
 A mechanism is needed to
continually reevaluate the
model after deployment

You might also like