0% found this document useful (0 votes)
20 views

Data sci notes

The document outlines a course on Data Analytics, detailing its objectives, outcomes, and the importance of data analytics in various sectors. It covers key concepts such as the types of data analytics, roles of data professionals, and practical applications across industries like healthcare, finance, and marketing. The course aims to equip students with skills in data collection, analysis, visualization, and decision-making to enhance business performance.

Uploaded by

Dharshini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Data sci notes

The document outlines a course on Data Analytics, detailing its objectives, outcomes, and the importance of data analytics in various sectors. It covers key concepts such as the types of data analytics, roles of data professionals, and practical applications across industries like healthcare, finance, and marketing. The course aims to equip students with skills in data collection, analysis, visualization, and decision-making to enhance business performance.

Uploaded by

Dharshini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

DIGITAL NOTES

Course Title : DATA ANALYTICS


Course Code : R22MBA19
Course (Year/Semester) : MBA II Year I semester
Course Type : Core
Course Credits 4

Course Objectives:

 To explain the concepts of data analytics.


 To provide an understanding on digital data.
 To impart the knowledge on the data visualization and big data analytics.
 To comprehend the gathering, cleaning and organizing data from various sources.
 To understand the role played by various statistical tools and techniques.

Course Outcomes:
Students will be able:
 To gain a comprehensive understanding of data terminologies and concepts.
 To collect, clean, and preprocess data for analysis.
 To create compelling visual representations of data to aid in decision-making processes.
To implement data analytics projects, demonstrating project management skills,
Team work and problem-solving abilities.
 To develop proficiency in statistical analysis and apply various techniques to extract
meaningful insights from data.
---------------------------------------------------------------------------------------------------------------------
UNIT – I: INTRODUCTION TO DATA ANALYTICS

Introduction: Meaning of Data Analytics- Need of Data Analytics- Business Analytics vs. Data
Analytics - Categorization of Data Analytical Models. Data Scientist vs. Data Engineer vs.
Data Analyst- Role of Data Analyst- Data Analytics in Practice.

INTRODUCTION:
Data analytics converts raw data into actionable insights. It includes a range of tools,
technologies, and processes used to find trends and solve problems by using data. Data
analytics can shape business processes, improve decision-making, a this type of analysis helps
describe or summaries quantitative data by presenting statistics.
For example, descriptive statistical analysis could show sales distribution across a group of
employees and the average sales figure per employee. Descriptive analysis answers the question,
“What happened and foster business growth.

MEANING:
Data Analysis is the process of systematically applying statistical and/or logical techniques to
describe and illustrate, condense and recap, and evaluate data.

NEED OF DATA ANALYTICS:


The use of data analytics in product development is a reliable understanding of future
requirements.
The company will understand the current market situation of the product.
They can use the techniques to develop new products as per market requirements.
The ability to make data-driven decisions can give organizations a competitive edge in
their markets. Data analysts are essential for leveraging the power of data.
They use data and turn it into meaningful insights that can drive better decision-making.
Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of data.

 Improved Decision-Making – If we will have supporting data in favor of a decision


that then we will be able to implement them with even more success probability. For
example, if a certain decision or plan has to lead to better outcomes then there will be
no doubt in implementing them again.
 Better Customer Service – Churn modeling is the best example of this in which we try
to predict or identify what leads to customer churn and change those things accordingly
so, that the attrition of the customers is as low as possible which is a most important
factor in any organization.
 Efficient Operations – Data Analytics can help us understand what the demand of the
situation is and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
 Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.
How Data Analytics Will Help a Business Grow:

1. Analysis of business value Chain:

There are companies that’ll help you in finding the insights of the value chains that are already
there in your organization and his is going to be done through data analytics.

2. Industry knowledge:

Industry knowledge is another thing that you’ll be able to comprehend once you get into data
analytics; it is going to show how you can go about your business in the near future and what is
that the economy already has its hands on. That’s how you are going to avail the benefit before
anyone else.

3. Seeing the opportunities:

As the economy keeps on changing and keeping pace with the dynamic trends is very
important but at the same time profit making is one thing that an organization would most of
the time aim for, Data Analytics gives us analyzed data that helps us in seeing opportunities
before the time that’s another way of unlocking more options.
BUSINESS ANALYTICS vs. DATA ANALYTICS

Business Analytics: Business analytics is a data-driven method used by organizations to gain


valuable insights into their business operations, improve decision-making processes, and
enhance overall performance. It involves the use of statistical analysis, predictive modeling,
data mining, and quantitative analysis to interpret data and make informed business decisions.

Importance of Business Analytics:

Informed Decision Making: Business analytics provides data-driven insights, enabling


organizations to make informed decisions based on evidence rather than intuition.

Competitive Advantage: Analyzing market trends, customer behavior, and competitors’


strategies can help businesses gain a competitive edge by identifying opportunities and threats in
the market.

Improved Efficiency: By analyzing operational data, organizations can identify inefficiencies


and optimize processes, leading to cost savings and improved productivity.

Customer Insights: Business analytics helps in understanding customer preferences, behavior,


and feedback, allowing businesses to tailor their products and services to meet customer needs
effectively.

Risk Management: Analyzing historical and current data can help businesses identify potential
risks and develop strategies to mitigate them, reducing the likelihood of losses.

Innovation: By analyzing data, businesses can identify patterns and trends that can lead to
innovative product or service offerings, improving their position in the market.

Data Analytics:
Data analytics is the process of examining, cleaning, transforming, and interpreting complex
datasets to extract valuable insights, draw conclusions, and support decision-making. It involves
various techniques from statistics, mathematics, and computer science to uncover patterns,
trends, correlations, and other useful information within the data. The primary goal of data
analytics is to gain a deeper understanding of data, solve problems, and aid in strategic decision-
making.
DATA ANALYTICS

Analytics is the discovery and communication of meaningful patterns in data. Especially,


valuable in areas rich with recorded information, analytics relies on the simultaneous
application of statistics, computer programming, and operation research to qualify
performance. Analytics often favors data visualization to communicate insight.
Firms may commonly apply analytics to business data, to describe, predict, and improve
business performance. Especially, areas within include predictive analytics, enterprise decision
management, etc. Since analytics can require extensive computation (because of big data),
algorithms and software harness the most current methods in computer science.
Data Analytics aims to get actionable insights resulting in smarter decisions and better business
outcomes.
It is critical to design and built a data warehouse or Business Intelligence (BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data sets.

What is Data Analytics?


In this new digital world, data is being generated in an enormous amount which opens new
paradigms. As we have high computing power as well as a large amount of data we can make
use of this data to help us make data-driven decision making. The main benefits of data-driven
decisions are that they are made up by observing past trends which have resulted in beneficial
results.
In short, we can say that data analytics is the process of manipulating data to extract useful
trends and hidden patterns which can help us derive valuable insights to make business
predictions.

TYPES / CATEGORIES/ MODELS OF DATA ANALYTICS


There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
 Predictive Analytics

Predictive analytics turn the data into valuable, actionable information. Predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling, machine
learning, data mining, and game theory that analyze current and historical facts to make
predictions about a future event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
Basic Corner Stone’s of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling
 Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard
 Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how to
take advantage of a future opportunity or mitigate a future risk and illustrate the implication of
each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such as
economic data, population demography, etc.
 Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data of
the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming. Common techniques used
for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations
1. transportation businesses cut expenses and speed up delivery times.

Data Scientist vs. Data Engineer vs. Data Analyst

 Data Analyst
Most entry-level professionals interested in getting into a data-related job start off as Data
analysts. Qualifying for this role is as simple as it gets. All you need is a bachelor’s degree and
good statistical knowledge. Strong technical skills would be a plus and can give you an edge
over most other applicants. Other than this, companies expect you to understand data handling,
modeling and reporting techniques along with a strong understanding of the business.

 Data Engineer
Data Engineer either acquires a master’s degree in a data-related field or gather a good amount of
experience as a Data Analyst. A Data Engineer needs to have a strong technical background with
the ability to create and integrate APIs. They also need to understand data pipelining and
performance optimization.

 Data Scientist

Data Scientist is the one who analyses and interpret complex digital data. While there are several
ways to get into a data scientist’s role, the most seamless one is by acquiring enough experience
and learning the various data scientist skills These skills include advanced statistical analyses, a
complete understanding of machine learning, data conditioning etc.

For a better understanding of these professionals, let’s dive deeper and understand their required
skill-sets.
ROLE OF DATA ANALYST

Common responsibilities for Data Analysts include extracting data using special tools and
software, responding to data-related queries, setting up processes to make data more efficient,
analyzing and interpreting trends from the data, and reporting trends to add business value.

1. Data Collection:

Data Gathering: Data analysts collect data from various sources, including surveys, databases,
and web sources.

Data Cleaning: They clean and preprocess data to remove inconsistencies, errors, and ensure it's
ready for analysis.

2. Data Analysis:

Exploratory Data Analysis (EDA): They perform EDA to understand the data's structure,
relationships, and distributions.

Statistical Analysis: Utilize statistical methods to analyze data and extract meaningful insights.

Predictive Analysis: Build predictive models using statistical techniques and machine learning
algorithms.

3. Data Visualization:
Data Visualization: Present data findings through visual means such as charts, graphs, and
dashboards, making complex data more accessible to stakeholders.

Dashboard Creation: Develop interactive dashboards that allow users to explore data on their
own.

4. Interpreting Results:

Interpretation: Analyze the results of data analyses and translate them into actionable business
insights.

Recommendations: Provide recommendations based on data to support decision-making


processes.

5. Communication:

Reporting: Prepare and deliver reports and presentations to both technical and non-technical
stakeholders.

Storytelling: Frame data insights into compelling stories that can be easily understood by people
without a technical background.

6. Continuous Learning:

Stay Updated: Keep up-to-date with the latest tools, technologies, and techniques in data
analysis and data science.

Skill Enhancement: Continuously enhance programming skills, statistical knowledge, and


domain expertise.

7. Problem-Solving:

Identify Problems: Work with stakeholders to identify business problems that can be solved
using data.

Innovative Solutions: Develop innovative solutions using data analysis methods to address
business challenges.

8. Ethics and Privacy:

Data Ethics: Adhere to ethical standards and guidelines when working with sensitive or private
data.

Data Privacy: Ensure compliance with data privacy regulations and protect the confidentiality of
data.

9. Collaboration:
Collaborative Work: Collaborate with data engineers, data scientists, and other professionals to
achieve common goals.

Cross-functional Collaboration: Work closely with different departments to understand their


data needs and provide relevant insights.

DATA ANALYTICS IN PRACTICE.

Data analytics has a wide range of applications across various industries and sectors. Here are
some of the key areas where data analytics is commonly applied:

How to Get Started in Data Analytics – A Roadmap for Beginners

• Step 1: Get to Know the Role of a Data Analyst.

• Step 2: Explore Job Requirements for Data Analyst Roles.

• Step 3: Get Comfortable with Math and Statistics.

• Step 4: Master Excel for Data Analysis.

• Step 5: Master SQL for Data Extraction.

WHERE I CAN PRACTICE DATA ANALYST:

• SQL Bolt.

• Excel Practice Online.

• Tableau.

• Python Tutorial.

• Data Analysis With Python (YouTube)

• Insights From Data With Big Query.

• Statistics - A Full University Course on Data Science Basics

 Business Intelligence: Data analytics helps businesses make informed decisions by


analyzing past data and predicting future trends. It's used for market research,
understanding customer behavior, and improving operational efficiency.
 Healthcare: In healthcare, data analytics is used for predictive analytics, patient care
management, fraud detection, and drug development. Analyzing patient data can also
lead to more personalized and effective treatments.
 Finance: Financial institutions use data analytics for fraud detection, risk management,
customer insights, and algorithmic trading. It helps in analyzing market trends and
making investment decisions.
 Marketing: Data analytics is essential in digital marketing for customer segmentation,
retargeting, and analyzing the effectiveness of marketing campaigns. Marketers use it to
understand customer preferences and optimize their strategies accordingly.
 E-commerce: E-commerce companies analyze customer behavior data to improve user
experience, personalize product recommendations, optimize pricing strategies, and
manage inventory effectively.
 Manufacturing and Supply Chain: Data analytics is used in optimizing supply chains,
predicting equipment failures, ensuring quality control, and demand forecasting. It helps
in streamlining production processes and reducing costs.
 Telecommunications: Telecommunication companies use data analytics to monitor
network performance, improve customer service, detect fraud, and optimize bandwidth
allocation.
 Education: Data analytics is employed in education for student performance analysis,
personalized learning experiences, and predicting student outcomes. It helps educators
tailor their teaching methods to individual student needs.
 Government and Public Policy: Governments use data analytics for various purposes
such as optimizing public transportation, predicting disease outbreaks, improving public
safety, and analyzing crime patterns. It also aids in policy-making decisions based on
data-driven insights.
 Sports Analytics: Sports teams and organizations use data analytics for performance
analysis, injury prevention, player scouting, and fan engagement. It helps in optimizing
team strategies and improving player performance.
 Human Resources: HR departments use data analytics for talent acquisition, employee
engagement analysis, workforce planning, and predicting employee turnover. It helps in
creating a more productive and satisfied workforce.
 Environmental Analysis: Environmental agencies use data analytics to monitor climate
change, analyze pollution patterns, and optimize resource management. It assists in
making informed decisions to address environmental challenges.

These applications demonstrate the versatility and significance of data analytics in


modern society, impacting various aspects of our lives and businesses.
UNIT – II: DEALING WITH DIGITAL DATA AND DATA SCIENCE

Data: Introduction to Digital Data- Types of Digital Data - Data Collection- Data
Preprocessing- Data Preprocessing Elements: Data Quality, Data Cleaning, and Data
Integration - Data Reduction-Data Transformation-Data Discretization.

Data Science Project Life Cycle: Business Requirement- Data Acquisition- Data Preparation-
Hypothesis and Modeling- Evaluation and Interpretation- Deployment- Operations-
Optimization- Applications for Data Science.

INTRODUCTION

Digital data is the electronic representation of information in a format or language that machines
can read and understand. In more technical terms, digital data is a binary format of information
that's converted into a machine-readable digital format. The power of digital data is that any
analog inputs, from very simple text documents to genome sequencing results, can be
represented with the binary system.

 Whenever you send an email, read a social media post or take pictures with your digital
camera you are working with the digital data.
 In general data can be any character or text, numbers, voice messages, SMS, wtsapp
messages, pictures, sound and video.

MEANING:

Any data that can be processed by digital computer and stored in the sequences of 0‘s & 1’s
(binary language) knows as digital data.

 Digital data can represent a wide range of information, including text, images, videos,
sound, and more. Different types of data are encoded using specific digital formats. For
instance:

 Text: Text characters are represented using character encoding standards like ASCII or
Unicode.
 Images: Images are represented as a grid of pixels, each pixel being represented by
binary numbers indicating its color.
 Audio: Sound waves are converted into digital signals through a process called sampling,
where the amplitude of the sound wave is measured at regular intervals and converted
into binary numbers.
 Video: Videos are a sequence of images displayed rapidly. Each frame of the video is
represented as a digital image.
TYPES OF DIGITAL DATA

 Unstructured Data: This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program. About 80-90% data of an organizations is in
this format; (e.g:memos, chat rooms, power point presentations, images, videos, letters,
researchers, white papers, body of an email.etc.)
 Semi-Structured Data: This is the data which does not conform to a data model but has some
structure. However it is not in a form which can be used easily by a computer program; for E.g.
Emails, XML, markup languages like html, EYX.
 Structured Data: This is the data which is in an organized form (e.g: in rows and columns) and
can be easily used by a computer program. Relationships exist between entities of data, such as
classes and their objects. Data stored in databases is an example of structured data.

10

10
Structured Data
Semi Structured Data
Unstrctured Data

80
STRUCTURED DATA:

UNSTRUCTURED DATA:
SEMI STRUCTURED DATA:

DATA COLLECTION

Data collection is the process of gathering and measuring information on variables of interest, in
a systematic and organized manner. It is a fundamental step in the research process and is crucial
for making informed decisions, conducting analyses, and drawing conclusions. Here are the key
aspects of data collection:

1. Define the Objectives:

Clearly define the purpose of the data collection. What do you want to achieve? What questions
are you trying to answer?

2. Choose Data Sources:

Determine the sources from which data will be collected. Sources can be primary (collected
firsthand) or secondary (already existing data).

3. Select Data Collection Methods:


Surveys and Questionnaires: Structured sets of questions distributed to respondents.

Interviews: Direct one-on-one or group discussions with participants.

Observations: Systematically watching and recording behavior.

Experiments: Controlled tests to collect data under specific conditions.

Sensor Data: Gathering data from various sensors (temperature, GPS, etc.).

Web Scraping: Extracting data from websites.

Focus Groups: Discussions with a selected group of people about a specific topic.

4. Design the Data Collection Tool:

Develop surveys, questionnaires, interview protocols, or other tools needed for data collection.
Ensure they are clear, unbiased, and appropriate for the target audience.

5. Pilot Testing:

Test the data collection tools on a small scale to identify and fix any issues with the questions or
methods.

6. Data Collection:

Implement the data collection process. This may involve administering surveys, conducting
interviews, or other chosen methods.

7. Data Recording and Storage:

Record the collected data systematically. Ensure proper storage and organization to prevent data
loss or corruption.

8. Data Validation and Cleaning:

Validate the collected data to ensure accuracy and reliability. Clean the data by removing
inconsistencies, errors, or outliers.

9. Data Analysis:

Analyze the cleaned data to identify patterns, trends, relationships, or insights.

10. Interpretation and Reporting:

Interpret the results of the data analysis. Prepare reports, charts, graphs, or presentations to
convey the findings effectively.
11. Ethical Considerations:

Ensure that data collection follows ethical guidelines, including informed consent, privacy, and
confidentiality of participants.

12. Continuous Monitoring:

Continuously monitor the data collection process to identify and address any issues promptly.

Good data collection practices are essential for ensuring the reliability and validity of the
collected information, which forms the basis for meaningful analysis and informed decision-
making.

DATA PREPROCESSING

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.

 It is a data mining technique that involves transforming raw data into an understandable
format.

Meaning:-Data preprocessing is a crucial step in the data analysis and machine learning process.
It involves cleaning, transforming, and organizing raw data into a format that can be effectively
utilized for analysis or training machine learning models.
ELEMENTS:-
Data Quality:
Data quality measures how well a dataset meets criteria for accuracy, completeness, validity,
consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data
governance initiatives within an organization.

Measures of Data Quality:

 Accuracy:
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

 Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different format s,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.

Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of
data analysis and to avoid overfitting of the model. Some common steps involved in data
reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic sampling.

Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.

Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.
Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
 Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
 Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

 Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering
Data Science Project Life Cycle

Earlier data used to be much less and generally accessible in a well-structured form, that we
could save effortlessly and easily in Excel sheets, and with the help of Business Intelligence
tools data can be processed efficiently.
But Today we used to deals with large amounts of data like about 3.0 quintals bytes of records
is producing on each and every day, which ultimately results in an explosion of records and
data. According to recent researches, it is estimated that 1.9 MB of data and records are
created in a second that too through a single individual.
So this is a very big challenge for any organization to deal with such a massive amount of data
generating every second. For handling and evaluating this data we required some very
powerful, complex algorithms and technologies and this is where Data science comes into the
picture.
The following are some primary motives for the use of Data science technology:
1. It helps to convert the big quantity of uncooked and unstructured records into
significant insights.
2. It can assist in unique predictions such as a range of surveys, elections, etc.
3. It also helps in automating transportation such as growing a self-driving car; which
can be the future of transportation.
4. Companies are shifting towards Data science and opting for this technology.
Amazon, Netflix, etc, which cope with the big quantity of data, are the use of
information science algorithms for higher consumer experience.

The data science project life cycle outlines the stages and steps involved in executing a data
science project from conception to deployment. While the exact process can vary depending on
the project and organization, the following is a common framework for the data science project
life cycle:

1. Problem Definition and Understanding:

 Define the problem or business question you want to address with data science.
 Understand the domain and context of the problem.
 Determine the objectives and goals of the project.

2. Data Collection:

 Identify and gather relevant data sources.


 Clean and pre-process the data to make it suitable for analysis.
 Ensure data quality and handle missing values.
3. Exploratory Data Analysis (EDA):

 Perform data visualization and summary statistics to gain insights.


 Identify patterns, trends, and anomalies in the data.
 Formulate initial hypotheses.

4. Feature Engineering:

 Create new features or transform existing ones to improve model performance.


 Select relevant features for modelling.
 Consider techniques like one-hot encoding, scaling, and dimensionality reduction.

5. Model Selection and Training:

 Choose appropriate machine learning or statistical models.


 Split the data into training, validation, and test sets.
 Train and fine-tune models using the training data.
 Evaluate model performance using validation data.
 Iterate on model selection and hyper parameter tuning as needed.

6. Model Evaluation:

 Assess model performance using appropriate metrics (e.g., accuracy, precision, recall,
F1-score, RMSE, etc.).
 Compare multiple models and select the best-performing one.
 Perform cross-validation to estimate model generalization.

7. Model Deployment:

 Prepare the selected model for deployment in a production environment.


 Integrate the model into the production system or application.
 Monitor the model's performance in real-world usage.

8. Communication and Reporting:

 Communicate the results and insights to stakeholders.


 Create visualizations and reports to convey findings effectively.
 Provide recommendations and insights for decision-making.

9. Documentation:
 Document the entire project, including data sources, data pre-processing steps, modelling
techniques, and results.
 Ensure code is well-documented for future reference and collaboration.

10. Maintenance and Monitoring:

 Continuously monitor the deployed model's performance.


 Re-train or re-evaluate the model periodically with new data.
 Make necessary updates and improvements to the model as the business evolves.

11. Feedback Loop:

 Collect feedback from end-users and stakeholders.


 Use feedback to refine the model and improve its performance.
 Iterate on the project as needed based on ongoing feedback.

12. Deployment and Scaling:

 If the project proves successful, consider scaling up the deployment to handle larger data
volumes or additional use cases.

13. Archiving and Documentation:

 Properly archive the project and its assets for future reference and compliance purposes.

The data science project life cycle is an iterative process, and it's important to maintain flexibility
throughout each stage as new insights and challenges may arise. Collaboration, effective
communication, and a focus on business objectives are key elements for success in data science
projects.
Data Science Lifecycle revolves around the use of machine learning and different analytical
strategies to produce insights and predictions from information in order to acquire a
commercial enterprise objective. The complete method includes a number of steps like data
cleaning, preparation, modelling, model evaluation, etc. It is a lengthy procedure and may
additionally take quite a few months to complete. So, it is very essential to have a generic
structure to observe for each and every hassle at hand. The globally mentioned structure in
fixing any analytical problem is referred to as a Cross Industry Standard Process for Data
Mining or CRISP-DM framework.

The lifecycle of Data Science


1. Business Understanding: The complete cycle revolves around the enterprise goal.
What will you resolve if you do not longer have a specific problem? It is extraordinarily
essential to apprehend the commercial enterprise goal sincerely due to the fact that will
be your ultimate aim of the analysis. After desirable perception only we can set the
precise aim of evaluation that is in sync with the enterprise objective. You need to
understand if the customer desires to minimize savings loss, or if they prefer to predict
the rate of a commodity, etc.

2. Data Understanding: After enterprise understanding, the subsequent step is data


understanding. This includes a series of all the reachable data. Here you need to intently
work with the commercial enterprise group as they are certainly conscious of what
information is present, what facts should be used for this commercial enterprise
problem, and different information. This step includes describing the data, their
structure, their relevance, their records type. Explore the information using graphical
plots. Basically, extracting any data that you can get about the information through
simply exploring the data.

3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either eliminating them or imputing
them, treating inaccurate data through eliminating them, additionally test for outliers
the use of box plots and cope with them. Constructing new data, derive new elements
from present ones. Format the data into the preferred structure, eliminate undesirable
columns and features. Data preparation is the most time-consuming but arguably the
most essential step in the complete existence cycle. Your model will be as accurate as
your data.

4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than constructing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-
graphs, Relations between distinct aspects are captured via graphical representations
like scatter plots and warmth maps. Many data visualization strategies are considerably
used to discover each and every characteristic individually and by means of combining
them with different features.

5. Data Modelling: Data modelling is the coronary heart of data analysis. A model takes
the organized data as input and gives the preferred output. This step consists of
selecting the suitable kind of model, whether the problem is a classification problem, or
a regression problem or a clustering problem. After deciding on the model family,
amongst the number of algorithms amongst that family, we need to cautiously pick out
the algorithms to put into effect and enforce them. We need to tune the hyper
parameters of every model to obtain the preferred performance. We additionally need to
make positive there is the right stability between overall performance and
generalizability. We do no longer desire the model to study the data and operate poorly
on new data.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be
deployed. The model is examined on an unseen data, evaluated on a cautiously thought
out set of assessment metrics. We additionally need to make positive that the model
conforms to reality. If we do not acquire a quality end result in the evaluation, we have
to re-iterate the complete modelling procedure until the preferred stage of metrics is
achieved. Any data science solution, a machine learning model, simply like a human,
must evolve, must be capable to enhance itself with new data, adapt to a new evaluation
metric. We can construct more than one model for a certain phenomenon; however, a
lot of them may additionally be imperfect. The model assessment helps us select and
construct an ideal model.

7. Model Deployment: The model after a rigorous assessment is at the end deployed in
the preferred structure and channel. This is the last step in the data science life cycle.
Each step in the data science life cycle defined above must be laboured upon carefully.
If any step is performed improperly, and hence, have an effect on the subsequent step
and the complete effort goes to waste. For example, if data is no longer accumulated
properly, you’ll lose records and you will no longer be constructing an ideal model. If
information is not cleaned properly, the model will no longer work. If the model is not
evaluated properly, it will fail in the actual world. Right from Business perception to
model deployment, every step has to be given appropriate attention, time, and effort.
Optimization in machine learning involves adjusting algorithms to better align with desired
models. It's a great way to understand and optimize data. For example, your optimization
algorithm can be trained to catch inaccuracies or inconsistencies in the system, removing the
need for you to comb through data by hand.

Real-world Applications of Data Science

1.In Search Engines

The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly use Search engines like Google, Yahoo, Safari,
Firefox, etc. So Data Science is used to get Searches faster.

For Example, When we search for something suppose “Data Structure and algorithm courses”
then at that time on Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is done using
Data Science, and we get the topmost visited Web Links.

2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With
the help of Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what as the speed limit in highways,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future. It allows the companies to predict
customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the future
outcome. Data is analyzed in such a way that it makes it possible to predict future stock prices
over a set timetable.
4. In E-Commerce

E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.

For Example, When we search for something on the E-commerce websites we get suggestions
similar to choices according to our past data and also we get recommendations according to
most buy the product, most rated, most searched, etc. This is all done with the help of Data
Science.
5. In Health Care

In the Healthcare Industry data science act as a boon. Data Science is used for:

 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.

6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search it
and after that, I changed my mind to buy offline. In Real -World Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the internet
in the social media, in the websites, in the apps everywhere I will see the recommendation of
that mobile phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer
will improve its performance. There are many games like Chess, EA Sports, etc. will use Data
Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with
full disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of
time, resources, and finance or developing new Medicine or drug but with the help of Data
Science, it becomes easy because the prediction of success rate can be easily determined based
on biological data or factors. The algorithms based on data science will forecast how this will
react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Auto complete

AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science concept
of Auto complete feature is used where he/she is an efficient choice to auto-complete the
whole line. Also in Search Engines in social media, in various apps, AutoComplete feature is
widely used.
UNIT – III: BIG DATA MANAGEMENT AND DATA VISUALIZATION

Introduction to Big Data: Evolution of Big Data concept – Features of Big Data- Big Data
Challenges-Big Data Analytics.
Introduction to Data Visualization: Data Visualization concept- Importance of data
visualization –- Structure of Visualization - Tools for Data visualization- Data Queries-Data
Dashboards- Principles of effective Data Dashboards-Applications of Data Dashboards.

Introduction to Big Data

The era of digital intelligence has led to a substantial shift towards fast data generation. With the rise
of social media and ecommerce, 2.5 quintillion bytes of data are produced every day. For example,
the New York Stock Exchange generates about one terabyte of new trade data per day. This amount
of information is not easy to process, especially insofar as 90% of it comes unstructured and
disorganized. This continuous generation of huge data volumes that are difficult to analyze using
traditional data mining techniques is called big data, and it is one of the most conspicuous
phenomena in the 21st-century world.

Evolution of Big Data concept

The concept of Big Data has evolved significantly over the years, driven by advancements in
technology, changes in data generation and consumption patterns, and the growing need for
businesses and organizations to extract meaningful insights from vast amounts of data. Here's a brief
overview of the evolution of the Big Data concept:

Early Days (Before 2000):

The term "Big Data" wasn't widely used, and the challenges associated with handling large
volumes of data were not as prominent.
 Traditional databases and data processing techniques were primarily used for relatively
smaller datasets.
Emergence of the Term (2000-2010):

 With the proliferation of the internet, social media, and e-commerce, the volume of data
being generated increased exponentially.
 Doug Laney's famous 3Vs model (Volume, Velocity, Variety) gained prominence as a way
to characterize the challenges posed by Big Data.
 Technologies like Apache Hadoop, an open-source framework for distributed storage and
processing of large data sets, emerged.

Technological Advancements (2010-2015):


 The open-source ecosystem around Big Data continued to grow, with tools like Apache
Spark gaining popularity for faster and more flexible data processing.
 Cloud computing platforms, such as Amazon Web Services (AWS), Google Cloud Platform
(GCP), and Microsoft Azure, started offering scalable and flexible infrastructure for Big
Data processing.
 The concept of the Data Lake gained traction, allowing organizations to store vast amounts
of raw and unstructured data.
Integration with Analytics and Machine Learning (2015-2020):

 Big Data became closely intertwined with advanced analytics and machine learning, as
organizations sought to derive actionable insights from their data.
 The integration of Big Data with real-time analytics became more prevalent, allowing
organizations to make quicker decisions based on up-to-date information.
 The rise of streaming data technologies enabled the processing of data in real time.
Focus on Data Governance and Privacy (2018-present):

 As the importance of data privacy and governance increased, there was a growing emphasis
on ensuring ethical and responsible use of Big Data.
 Regulatory frameworks, such as GDPR (General Data Protection Regulation), influenced
how organizations handle and process personal data.
Beyond the 3Vs (2020s and Beyond):

 The original 3Vs model expanded to include additional characteristics like Veracity,
Variability, and Value.
 The concept of Edge Computing gained importance as organizations sought to process data
closer to the source to reduce latency and bandwidth usage.
 The use of AI and machine learning in handling and analyzing Big Data continued to grow,
allowing for more sophisticated insights.
Hybrid and Multi-Cloud Approaches (Ongoing):

 Organizations increasingly adopted hybrid and multi-cloud strategies to manage their Big
Data workloads, combining on-premises infrastructure with cloud services.
 The focus shifted towards optimizing costs, performance, and flexibility in Big Data
solutions.
 The evolution of the Big Data concept is ongoing, and it continues to be shaped by
technological advancements, changing business requirements, and the evolving landscape of
data-related challenges and opportunities.
Definition of Big Data

Big Data refers to extremely large and complex datasets that are beyond the capability of traditional
data processing tools and methods to capture, store, manage, and analyze within a reasonable
timeframe.

The concept of Big Data is often described using the "3Vs" model:
1. Volume: Refers to the sheer size of the data. Big Data involves handling datasets that are
typically measured in terabytes, petabytes, or even exabytes.
2. Velocity: Refers to the speed at which data is generated, processed, and made available for
analysis. Big Data often involves real-time or near-real-time processing to keep up with the
constant flow of data.
3. Variety: Encompasses the diversity of data types, formats, and sources. Big Data includes
structured, semi-structured, and unstructured data from various sources such as text, images,
videos, social media, sensor data, and more.

 Over time, the concept has evolved to include additional characteristics such as veracity (the
quality and reliability of the data), variability (inconsistency in data flow), value (the
importance and usefulness of insights derived from the data), and others.
 The processing and analysis of Big Data require advanced technologies and tools, including
distributed computing frameworks like Apache Hadoop and Apache Spark, NoSQL
databases, machine learning algorithms, and cloud computing infrastructure. Organizations
leverage Big Data to extract valuable insights, make data-driven decisions, and gain a
competitive advantage in various domains, including business, healthcare, finance, and
research.

 The concept is characterized by three main dimensions, often referred to as the "3Vs":
Volume, Velocity, and Variety. Additionally, other characteristics like Veracity, Variability,
and Value have been included to provide a more comprehensive understanding of the
challenges and opportunities associated with Big Data.
Here's a breakdown of the key aspects of the definition:

1. Volume:
 Definition: Refers to the vast amounts of data generated or collected by organizations. This
data could come from various sources, such as social media, sensors, transactions, and more.
 Significance: The sheer quantity of data is one of the defining features of Big Data.
Traditional databases may struggle to handle such large volumes.

2. Velocity:
 Definition: Describes the speed at which data is generated, processed, and made
available for analysis. Big Data often involves real-time or near-real-time processing to
keep up with the continuous flow of data.
 Significance: The pace at which data is generated and needs to be processed is a crucial
factor. Real-time processing enables quick decision-making based on the most current
information.

3. Variety:
 Definition: Encompasses the diversity of data types, formats, and sources. Big Data
includes structured, semi-structured, and unstructured data from a wide range of origins,
such as text, images, videos, and more.
 Significance: Big Data is not limited to structured databases but includes a variety of
data types. Handling this diversity requires flexible data processing and analytics tools.

4. Veracity:
 Definition: Refers to the quality and reliability of the data. Big Data may include data
with varying levels of accuracy and reliability, and managing this aspect is crucial for
meaningful analysis.
 Significance: Ensuring the trustworthiness of the data is essential to prevent inaccurate
or misleading insights.

5. Variability:
 Definition: Reflects the inconsistency in the data flow, which may have irregular patterns or
periodic variations. Big Data processing systems need to handle fluctuations and variations
in data flow.
 Significance: Big Data is not always consistent, and the ability to manage variations in data
flow is essential for accurate analysis.

6. Value:
 Definition: Represents the importance and usefulness of the insights derived from Big
Data. The ultimate goal is to extract valuable insights and actionable information.
 Significance: The ultimate goal of Big Data is to extract valuable insights and
actionable information that can benefit an organization in terms of decision-making,
efficiency, and innovation.

7. Complexity:
Definition: Refers to the intricacy of data relationships and the challenges associated with
analyzing interconnected datasets.

Significance: Big Data solutions often deal with complex data structures, requiring
advanced analytics and tools to uncover meaningful patterns and relationships.

 The big data includes information produced by humans and devices. Device-driven data is
largely clean and organized, but of far greater interest is human-driven data that exist in
various formats and need more exquisite tools for proper processing and management.

 The big data collection is focused on the following types of data:


– Network data. This type of data is gathered on all kinds of networks, including social
media, information and technological networks, the Internet and mobile networks, etc.

– Real-time data. They are produced on online streaming media, such as YouTube, Twitch,
Skype, or Netflix.

-Transactional data. They are gathered when a user makes an online purchase (information
on the product, time of purchase, payment methods, etc.)

– Geographic data. Location data of everything, humans, vehicles, building, natural


reserves, and other objects are continuously supplied with satellites.

– Natural language data. These data are gathered mostly from voice searches that can be
made on different devices accessing the Internet.

– Time series data. This type of data is related to the observation of trends and phenomena
taking place at this very moment and over a period of time, for instance, global
temperatures, mortality rates, pollution levels, etc.

– Linked data. They are based on HTTP, RDF, SPARQL, and URIs web technologies and
meant to enable semantic connections between various databases so that computers could
read and perform semantic queries correctly.

 Challenges of Big Data

1. Storage: With vast amounts of data generated daily, the greatest challenge is storage
(especially when the data is in different formats) within legacy systems. Unstructured data
cannot be stored in traditional databases.

2. Processing: Processing big data refers to the reading, transforming, extraction, and
formatting of useful information from raw information. The input and output of information
in unified formats continue to present difficulties.

3. Security: Security is a big concern for organizations. Non-encrypted information is at risk


of theft or damage by cyber-criminals. Therefore, data security professionals must balance
access to data against maintaining strict security protocols.

4. Finding and Fixing Data Quality Issues


Many of you are probably dealing with challenges related to poor data quality, but solutions
are available. The following are four approaches to fixing data problems.
5. Correct information in the original database.
Repairing the original data source is necessary to resolve any data inaccuracies.
You must use highly accurate methods of determining who someone is.
6. Scaling Big Data Systems
Database sharding, memory caching, moving to the cloud and separating read-only and
write-active databases are all effective scaling methods. While each one of those approaches
is fantastic on its own, combining them will lead you to the next level.

7. Evaluating and Selecting Big Data Technologies


Companies are spending millions on new big data technologies, and the market for such
tools is expanding rapidly. In recent years, however, the IT industry has caught on to big
data and analytics potential. The trending technologies include the following:

 Hadoop Ecosystem
 Apache Spark
 NoSQL Databases
 R Software
 Predictive Analytics
 Prescriptive Analytics

8. Big Data Environments


In an extensive data set, data is constantly being ingested from various sources, making it
more dynamic than a data warehouse. The people in charge of the big data environment will
fast forget where and what each data collection came from.

9. Real-Time Insights
The term "real-time analytics" describes the practice of performing analyses on data as a
system is collecting it. Decisions may be made more efficiently and with more accurate
information thanks to real-time analytics tools, which use logic and mathematics to deliver
insights on this data quickly.

10. Data Validation


Before using data in a business process, its integrity, accuracy, and structure must be
validated. The output of a data validation procedure can be used for further analysis, BI, or
even to train a machine learning model.

11. Healthcare Challenges


Electronic health records (EHRs), genomic sequencing, medical research, wearables, and
medical imagings are just a few examples of the many sources of health-related big data.
Introduction to Data Visualization:

Data visualization is the representation of information and data using charts, graphs, maps, and
other visual tools. These visualizations allow us to easily understand any patterns, trends, or outliers
in a data set.

Data visualization also presents data to the general public or specific audiences without technical
knowledge in an accessible manner. For example, the health agency in a government might provide
a map of vaccinated regions.

The purpose of data visualization is to help drive informed decision-making and to add colourful
meaning to an otherwise bland database.

 Meaning

Data visualization is the process of representing data in a visual way, such as using graphs, charts,
or maps. It's used to communicate complex information in an intuitive way, making it easier for
people to understand and analyze data.

The goal of data visualization is to make it easier to identify trends, patterns, and outliers in large
data sets. It can also be used to deliver visual reporting on the performance, operations, or general
statistics of an application, network, hardware, or IT asset.

The four pillars of data visualization are: Distribution, Relationship, Comparison, and Composition.

Some steps for data visualization include:


 Being clear on the question
 Knowing your data and starting with basic visualizations
 Identifying messages of the visualization, and generating the most informative
 Choosing the right chart type
 Using color, size, scale, shapes and labels to direct attention to the key

 Benefits of data visualization

Data visualization can be used in many contexts in nearly every field, like public policy, finance,
marketing, retail, education, sports, history, and more. Here are the benefits of data visualization:

Storytelling: People are drawn to colors and patterns in clothing, arts and culture, architecture, and
more. Data is no different—colors and patterns allow us to visualize the story within the data.

Accessibility: Information is shared in an accessible, easy-to-understand manner for a variety of


audiences.

Visualize relationships: It’s easier to spot the relationships and patterns within a data set when the
information is presented in a graph or chart.

Exploration: More accessible data means more opportunities to explore, collaborate, and inform
actionable decisions.

 Characteristics of Effective Graphical Visual:


 It shows or visualizes data very clearly in an understandable manner.
 It encourages viewers to compare different pieces of data.
 It closely integrates statistical and verbal descriptions of data set.
 It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends
to focus on visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.
 Using graphical representation, a story can be told more efficiently. Also, it requires less
time to understand picture than it takes to understand textual data.

 Tools for visualizing data


There are plenty of data visualization tools out there to suit your needs. Before committing to one,
consider researching whether you need an open-source site or could simply create a graph using
Excel or Google Charts. The following are common data visualization tools that could suit your
needs.

 Tableau
 Google Charts
 Dundas BI
 Power BI
 JupyteR
 Infogram
 ChartBlocks
 D3.js
 FusionCharts
 Grafana

 Types of data visualization

Visualizing data can be as simple as a bar graph or scatter plot but becomes powerful when
analyzing.

Some common types of data visualization include: Bar charts, Line charts, Scatter plots, Pie charts,
Heat maps.

Here are some common types of data visualizations:

 Table: A table is data displayed in rows and columns, which can be easily created in a
Word document or Excel spreadsheet.
 Chart or graph: Information is presented in tabular form with data displayed along an x
and y axis, usually with bars, points, or lines, to represent data in comparison. An info
graphic is a special type of chart that combines visuals and words to illustrate the data.
 Gantt chart: A Gantt chart is a bar chart that portrays a timeline and tasks specifically used
in project management.
 Pie chart: A pie chart divides data into percentages featured in “slices” of a pie, all adding
up to 100%.
 Geospatial visualization: Data is depicted in map form with shapes and colours that
illustrate the relationship between specific locations, such as a choropleth or heat map.
 Dashboard: Data and visualizations are displayed, usually for business purposes, to help
analysts understand and present data.
 Line Graph Visualization -is used to give a more in-depth view of the trend, projection,
average, and growth of a metric over time. The benefit of this type of visualization is that it
enables you to visualize patterns over a longer period and compare data to a previous period
or goal. In addition, this is a common visualization that will be familiar and easy to read by
most.
 Bar Graph Visualization – is best used to demonstrate comparisons between values. It is
especially valuable when you want to compare different Google Analytics Dimensions to
each other.

 Pie Chart Visualization – is usually used to illustrate numerical proportions through the
size of the slices. It is easy to read and ideal for demonstrating distribution.

 Table Visualization – is mostly used to list Metrics by order of importance, ranking, and so
on. It is a great option for producing a clear view and comparisons of Metric values. It can
show Total SUM, AVG, MIN, or MAX and data for multiple or just a single Metric.
 Funnel Visualization – is a great option for tracking the stages potential customers travel
through in order to become a customer.

 Number Visualization – is best for showing a simple Metric or to draw attention to one key
number. It is compact, clear and precise and is great for highlighting progress in your
Dashboard.

 Pipeline Visualization – it is best used to demonstrate the evolution of lead generation


through the connection of marketing and sales data. It provides a comprehensive view that
not only tracks the journey of leads but also incorporates KPIs for sales representatives,
enabling a deeper understanding of their contributions to the overall sales process.
 Progress Bar Visualization – is the best way of presenting the achieved progress towards
reaching a set Goal.

 Gauge Visualization – much like the progress bar, gauge visualization is best used to show
progress towards reaching a Goal but also for presenting maximum value.

 Compare Visualization – is an advanced version of the number visualization, as is often


used to compare two Metrics.
 Categories of Data Visualization
Data visualization is very critical to market research where both numerical and categorical data can
be visualized that helps in an increase in impacts of insights and also helps in reducing risk of
analysis paralysis. So, data visualization is categorized into following categories :

1. Numerical Data:
Numerical data is also known as quantitative data. Numerical data is any data where data generally
represents amount such as height, weight, age of a person, etc. Numerical data visualization is
easiest way to visualize data. It is generally used for helping others to digest large data sets and raw
numbers in a way that makes it easier to interpret into action. Numerical data is categorized into
two categories :

o Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
o Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).
The type of visualization techniques that are used to represent numerical data visualization is
Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
2. Categorical Data:
Categorical data is also known as qualitative data. Categorical data is any data where data
generally represents groups. It simply consists of categorical variables that are used to represent
characteristics such as a person’s ranking, a person’s gender, etc. Categorical data visualization
is all about depicting key themes, establishing connections, and lending context. Categorical
data is classified into three categories:

Binary Data – In this, classification is based on positioning (Example: Agrees or disagrees).

Nominal Data – In this, classification is based on attributes (Example: Male or female).

Ordinal Data –In this, classification is based on ordering of information (Example: Timeline or
processes).

The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram, etc.
 Applications of data visualisation

 Healthcare Industries

A dashboard that visualises a patient's history might aid a current or new doctor in comprehending a
patient's health. It might give faster care facilities based on illness in the event of an emergency.
Instead than sifting through hundreds of pages of information, data visualisation may assist in
finding trends.
Health care is a time-consuming procedure, and the majority of it is spent evaluating prior reports.
By boosting response time, data visualisation provides a superior selling point. It gives matrices that
make analysis easier, resulting in a faster reaction time.

 Business intelligence
When compared to local options, cloud connection can provide the cost-effective “heavy lifting” of
processor-intensive analytics, allowing users to see bigger volumes of data from numerous sources
to help speed up decision-making.
Because such systems can be diverse, comprised of multiple components, and may use their own
data storage and interfaces for access to stored data, additional integrated tools, such as those geared
toward business intelligence (BI), help provide a cohesive view of an organization's entire data
system (e.g., web services, databases, historians, etc.).
Multiple datasets can be correlated using analytics/BI tools, which allow for searches using a
common set of filters and/or parameters. The acquired data may then be displayed in a standardised
manner using these technologies, giving logical "shaping" and better comparison grounds for end
users.

 Military
It's a matter of life and death for the military; having clarity of actionable data is critical, and taking
the appropriate action requires having clarity of data to pull out actionable insights.
The adversary is present in the field today, as well as posing a danger through digital warfare
and cyber security. It is critical to collect data from a variety of sources, both organised and
unstructured. The volume of data is enormous, and data visualisation technologies are essential for
rapid delivery of accurate information in the most condensed form feasible. A greater grasp of past
data allows for more accurate forecasting.
Dynamic Data Visualization aids in a better knowledge of geography and climate, resulting in a
more effective approach. The cost of military equipment and tools is extremely significant; with bar
and pie charts, analysing current inventories and making purchases as needed is simple.

 Finance Industries
For exploring/explaining data of linked customers, understanding consumer behaviour, having a
clear flow of information, the efficiency of decision making, and so on, data visualisation tools are
becoming a requirement for financial sectors.
For associated organisations and businesses, data visualisation aids in the creation of patterns, this
aids in better investment strategy. For improved business prospects, data visualisation emphasises
the most recent trends.

 Data science
Data scientists generally create visualisations for their personal use or to communicate information
to a small group of people. Visualization libraries for the specified programming languages and
tools are used to create the visual representations.
Open source programming languages, such as Python, and proprietary tools built for
complicated data analysis are commonly used by data scientists and academics. These data
scientists and researchers use data visualisation to better comprehend data sets and spot patterns and
trends that might otherwise go undiscovered.
 Marketing
In marketing analytics, data visualisation is a boon. We may use visuals and reports to analyse
various patterns and trends analysis, such as sales analysis, market research analysis, customer
analysis, defect analysis, cost analysis, and forecasting. These studies serve as a foundation for
marketing and sales.

Visual aids can assist your audience grasp your main message by visually engaging them and
visually engaging them. The major advantage of visualising data is that it can communicate a point
faster than a boring spreadsheet.

In b2b firms, data-driven yearly reports and presentations don't fulfil the needs of people who are
seeing the information. They are unable to grasp the art of engaging with their audience in a
meaningful or memorable manner. Your audience will be more interested in your facts if you
present them as visual statistics, and you will be more inclined to act on your discoveries.

 Food delivery apps

When you place an order for food on your phone, it is given to the nearest delivery person. There is
a lot of math involved here, such as the distance between the delivery executive's present position
and the restaurant, as well as the time it takes to get to the customer's location.
Customer orders, delivery location, GPS service, tweets, social media messages, verbal comments,
pictures, videos, reviews, comparative analyses, blogs, and updates have all become common ways
of data transmission.
Users may obtain data on average wait times, delivery experiences, other records, customer service,
meal taste, menu options, loyalty and reward point programmes, and product stock and inventory
data with the help of the data.

 Real estate business


Brokers and agents seldom have the time to undertake in-depth research and analysis on their own.
Showing a buyer or seller comparable home prices in their neighbourhood on a map, illustrating
average time on the market, creating a sense of urgency among prospective buyers and managing
sellers' expectations, and attracting viewers to your social media sites are all examples of common
data visualisation applications.
If a chart is difficult to understand, it is likely to be misinterpreted or disregarded. It is also seen to
be critical to offer data that is as current as feasible. The market may not alter overnight, but if the
data is too old, seasonal swings and other trends may be overlooked.
Clients will be pulled to the graphics and to you as a broker or agent if they perceive that you know
the market. If you display data in a compelling and straightforward fashion, they will be drawn to
the graphics and to you as a broker or agent.

 Education
Users may visually engage with data, answer questions quickly, make more accurate, data-informed
decisions, and share their results with others using intuitive, interactive dashboards.
The ability to monitor students' progress throughout the semester, allowing advisers to act quickly
with outreach to failing students. When they provide end users access to interactive, self-service
analytic visualisations as well as ad hoc visual data discovery and exploration, they make quick
insights accessible to everyone – even those with little prior experience with analytics.

 E-commerce
In e-commerce, any chance to improve the customer experience should be taken. The key to
running a successful internet business is getting rapid insights. This is feasible with data
visualisation because crossing data shows features that would otherwise be hidden.
Your marketing team may use data visualisation to produce excellent content for your audience that
is rich in unique information. Data may be utilised to produce attractive narrative through the use of
info graphics, which can easily and quickly communicate findings.
Patterns may be seen all throughout the data. You can immediately and readily detect them if you
make them visible. These behaviours indicate a variety of consumer trends, providing you with
knowledge to help you attract new clients and close sales.(From)

Conclusion
When data is visualised, it is processed more quickly. Data visualisation organises all of the
information in a way that the traditional technique would miss.
We can observe ten data visualisation applications in many sectors, and as these industries continue
to expand, the usage of data visualisations will proliferate and evolve into key assets for these
organisations.
 Database Query Definition

 A data query is a request for specific information or insights from a dataset or database. It involves
retrieving, filtering, aggregating, or manipulating data based on certain criteria or conditions. Data
queries can be simple or complex, depending on the complexity of the desired information and the
structure of the dataset.

 In the context of databases, data queries are typically expressed using query languages such as SQL
(Structured Query Language) for relational databases or specialized query languages for NoSQL
databases. These queries can range from basic operations like selecting specific columns from a table
to more complex operations involving joins, filtering, grouping, and aggregation.

 Data queries are fundamental to data analysis, reporting, and decision-making processes in various
domains such as business, finance, healthcare, and scientific research. They enable users to extract
actionable insights from large volumes of data, facilitating informed decision-making and problem-
solving.

 Types of Data Queries

There are several types of data queries, each serving different purposes and allowing users to extract
specific information or insights from a dataset. Here are some common types of data queries:
 Select Query: A select query retrieves data from one or more tables in a database. It allows
users to specify the columns they want to retrieve and may include filtering conditions to
restrict the returned rows.

 Filtering Query: Filtering queries are used to retrieve data that meets specific criteria or
conditions. These queries typically include WHERE clauses to specify the conditions that
the data must satisfy.

 Aggregate Query: An aggregate query calculates summary statistics or aggregates over a


dataset. Common aggregate functions include SUM, AVG, COUNT, MIN, and MAX.
Aggregate queries are often used for tasks like calculating totals, averages, or counts across
groups of data.

 Join Query: A join query combines data from multiple tables based on a related column or
key. Joins are used to retrieve information that spans multiple tables, allowing users to
correlate data from different sources.

 Subquery: A subquery is a query nested within another query. Subqueries can be used to
retrieve data dynamically based on the results of another query. They are often used in
filtering conditions or as part of a larger query's logic.
 Sorting Query: Sorting queries arrange the retrieved data in a specified order, such as
ascending or descending order based on one or more columns. Sorting is useful for
organizing data for presentation or analysis.
 Grouping Query: Grouping queries group the data based on one or more columns and then
perform aggregate calculations within each group. They are used to analyze data at different
levels of granularity and to calculate summary statistics for each group.

 Insert Query: An insert query adds new data to a database table. It specifies the values to be
inserted into each column of the table.

 Update Query: An update query modifies existing data in a database table. It specifies the
changes to be made to one or more rows based on specified conditions.

 Delete Query: A delete query removes data from a database table based on specified
conditions. It deletes one or more rows that meet the specified criteria.

These are some of the common types of data queries used in database management systems and
data analysis tasks. The choice of query type depends on the specific requirements of the analysis or
operation being performed.

 Data Dashboards

A data dashboard is an interactive analysis tool used by businesses to track and monitor the
performance of their strategies with quality KPIs. Armed with real-time data, these tools enable
companies to extract actionable insights and ensure continuous growth.

They offer users a comprehensive overview of their company’s various internal departments, goals,
initiatives, processes, or projects. These are measured through key performance indicators (KPIs),
which provide insights that will enable you to foster growth and improvement.

Online dashboards provide immediate navigable access to actionable analysis that has the power to
boost your bottom line through continual commercial evolution.

To properly define dashboards, you need to consider the fact that, without the existence of
dashboards and dashboard reporting practices, businesses would need to sift through colossal stacks
of unstructured information, which is both inefficient and time-consuming.

Alternatively, a business would have to "shoot in the dark" concerning its most critical processes,
projects, and internal insights, which is far from ideal in today’s world.

 Types of Data Dashboards:


1. Operational dashboard: Operational dashboards are designed to monitor trends and pattern
changes while helping you work on specific performance benchmarks. These dashboards are
responsive, dynamic, and effective for in-the-moment decision-making.
2. Analytical dashboard: Analytical dashboards are typically used by senior executives and
decision-makers. These types of analytical tools enable users to drill down deep into specific
pockets of information and gather a level of intelligence that aids razor-sharp trend discovery from
a top-level perspective.
3. Strategic dashboard: Strategic dashboards provide information and visualizations that enable
users to discover which initiatives work best or worst. A big data dashboard packed with
invaluable metrics, these analytical tools help assess KPIs from every angle and, ultimately,
formulate business-boosting strategies across departments.
4. Informational dashboard: Informational dashboards are usually focused on a very specific
process or branch of information and are effective for gaining useful at-a-glance data or short-term
strategic decision-making.

 Principles of effective Data Dashboards-

Creating effective data dashboards involves adhering to several principles to ensure they are
informative, actionable, and user-friendly. Here are some key principles to consider when
designing data dashboards:

1. Clarity of Purpose: Clearly define the purpose of the dashboard and the specific audience
it serves. Understanding the users' needs and objectives will help guide the design and
content of the dashboard.

2. Focus on Key Metrics: Prioritize the display of key metrics and insights that are most
relevant to the users' goals and decision-making processes. Avoid cluttering the dashboard
with unnecessary information that may distract from the main objectives.

3. Simplicity and Conciseness: Keep the dashboard design simple and concise to facilitate
easy comprehension. Use clear and intuitive visualizations that convey information
efficiently without overwhelming the user.

4. Consistent Design and Layout: Maintain a consistent design and layout throughout the
dashboard to provide a cohesive user experience. Use standardized formatting, color
schemes, and typography to enhance readability and usability.

5. Use of Visual Hierarchy: Employ visual hierarchy techniques to emphasize important


information and guide users' attention to key insights. This may include varying the size,
color, and placement of elements to create emphasis and hierarchy.

6. Interactive Elements: Incorporate interactive elements such as filters, drill-downs, and


tooltips to enable users to explore the data and gain deeper insights. Interactivity enhances
user engagement and allows for more dynamic data exploration.
7. Responsive Design: Ensure that the dashboard is responsive and adaptable to different
screen sizes and devices. A responsive design ensures that users can access and interact
with the dashboard seamlessly across desktops, tablets, and smartphones.

8. Contextualization and Storytelling: Provide context and narrative around the data to help
users interpret the insights and understand their implications. Use annotations, captions,
and narrative elements to tell a compelling story with the data.

9. Data Quality and Accuracy: Ensure the accuracy and reliability of the data presented on
the dashboard by performing thorough data validation and quality checks. Display data
sources, definitions, and any relevant caveats to promote transparency and trust.

10. Iterative Improvement: Continuously solicit feedback from users and stakeholders to
identify areas for improvement and refinement. Iterate on the dashboard design based on
user feedback and evolving requirements to ensure its ongoing effectiveness.

By following these principles, you can create data dashboards that effectively communicate
insights, support decision-making, and drive action within organizations.

 Applications of Data Dashboards

1) Management KPI Dashboard

Our first data dashboard template is a management dashboard. It is a good example of a “higher
level” dashboard for a C-level executive. You’ll notice that this example is focused on
important management KPIs like:

 Number of new customers compared to targets


 The average revenue per customer
 Customer acquisition cost
 Gross revenue, target revenue, and last year’s revenue
2) Financial KPI Dashboard

Maintaining your financial health and efficiency is pivotal to your business's continual growth and
evolution. Without keeping your fiscal processes water-tight, your organization will essentially
become a leaky tap, draining your budgets dry right under your nose.

3) Sales Cycle Length Dashboard

This sales dashboard below is a sales manager’s dream. This dashboard example shows how long
customers are taking to move through your funnel, on average. It expands on this by showing how
different sales managers perform compared to others.
4) IT Project Management Dashboard

Our IT project management dashboard offers a wealth of at-a-glance insights geared towards
reducing project turnaround times while enhancing efficiency in key areas, including total tickets
vs. open tickets, projects delivered on budget, and average handling times.

All KPIs complement each other, shedding light on real-time workloads, overdue tasks,
budgetary trends, upcoming deadlines, and more. This perfectly balanced mix of metrics will
assist everyone within your IT department to balance their personal workloads while gaining the
insight required putting strategic enhancements into action and delivering projects with
consistent success.
5.) Human Resources Talent Dashboard

This dashboard description includes all-important information essential to developing a


modern HR report for professionals, managers, and VPs who must compete to attract the best
possible candidates and keep them in the long run.

You can also take a look at a quick overview of the hiring stats that include the time to fill,
training costs, new hires, and cost per hire. The total number of employees, monthly salary, and
vacancies will provide you with an at-a-glance overview of the talent management stats within
the first quarter of the year.
UNIT – IV: DESCRIPTIVE STATISTICAL MEASURES

Overview of using Data: Population and Samples, Measures of location, Measures of


Dispersion, Measures of Variability, Measures of Association.
Probability Distribution: Probability Distribution and Data Modeling, Discrete Probability
Distribution, Continuous Probability Distribution, Random Sampling from Probability
Distribution, Data Modeling and Distribution fitting.

MEASURES OF LOCATION :

A measure of central tendency (also referred to as measures of centre or central location) is a


summary measure that attempts to describe a whole set of data with a single value that
represents the middle or centre of its distribution. There are three main measures of central
tendency: mode. median. mean

AVERAGES

Averages, also known as measures of central tendency, are statistical tools used to represent a set
of data with a single value that summarizes the "center" or "typical" value of the dataset.
Averages are fundamental in various fields, including statistics, mathematics, economics, and
business, as they help in understanding and interpreting data, making comparisons, and drawing
meaningful conclusions. They provide a way to simplify complex data into a single,
representative value, making it easier to analyze and work with large datasets.

DEFINITION OF AVERAGES:

Averages are mathematical calculations that yield a single value that is considered representative
of a set of data points. They aim to provide insights into where the bulk of the data is
concentrated, helping to identify central tendencies or typical values within the dataset. There are
several types of averages, each with its method of calculation and specific use cases.

Objectives of Averaging:

The objectives of averaging are as follows:

1. Summarization: Averages simplify complex data by reducing it to a single value, making it


easier to understand and interpret.
2. Comparison: Averages facilitate comparisons between different datasets, allowing for
meaningful insights and informed decision-making.
3. Central Tendency: Averages provide a measure of central tendency, helping to identify where
most data points cluster or concentrate.
4. Prediction: In some cases, averages can be used for forecasting or predicting future values based
on historical data patterns.
5. Data Analysis: Averages serve as a starting point for further data analysis, such as variance
calculations and hypothesis testing.

Types of Averages:

There are several types of averages, each with its unique characteristics and applications:

1. Mean (Arithmetic Mean):

 Definition: The mean is the most common measure of central tendency. It's calculated by adding
up all the values in a dataset and then dividing the sum by the total number of data points.
 Formula: Mean (μ) = (Sum of all values) / (Total number of values).
 Characteristics:
 Sensitive to outliers: Extreme values can greatly affect the mean.
 Appropriate for interval and ratio data (continuous data).

Application: The mean is used to calculate average sales, costs, revenues, and other financial
metrics. It helps in budgeting, forecasting, and performance analysis

2. Median:

 Definition: The median is the middle value in a dataset when it's arranged in ascending or
descending order. If there's an even number of values, the median is the average of the two
middle values.
 Characteristics:
 Not affected by outliers: The median is resistant to extreme values.
 Useful for skewed or non-normally distributed data.
 Applicable to ordinal, interval, and ratio data.
Application: The median is often used in salary studies, income distribution analysis, and
market research to understand the typical income or price range.

3. Mode:

 Definition: The mode is the value that appears most frequently in a dataset.
 Characteristics:
 Useful for nominal data (categorical data).
 A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode.
Application: In marketing, it helps identify the most popular product or service. In inventory
management, it can indicate the most commonly ordered item.

4. Weighted Mean:
 Definition: The weighted mean is calculated by multiplying each value by its corresponding
weight and then summing the products. It's often used when some data points are more
significant than others.
 Formula: Weighted Mean (μ) = (Σ(xi * wi)) / Σwi, where xi is the data point, and wi is its
weight.
 Application: In financial analysis, it's used to calculate the weighted average cost of capital
(WACC) by considering the weights of debt and equity

5. Geometric Mean:

 Definition: The geometric mean is used for datasets with values that represent growth rates or
ratios. It's calculated as the nth root of the product of n values.
 Formula: Geometric Mean (μ) = (x1 * x2 * ... * xn)^(1/n).
 Characteristics:
 Appropriate for data with multiplicative relationships.
 Used in finance, biology, and other fields where ratios are important.

6. Harmonic Mean:

 Definition: The harmonic mean is calculated as the reciprocal of the arithmetic mean of the
reciprocals of the values. It's used for averaging rates or proportions.
 Formula: Harmonic Mean (μ) = n / (1/x1 + 1/x2 + ... + 1/xn), where xi is the data point.
 Characteristics:
 Appropriate for averaging rates or proportions, e.g., speed, efficiency, or time.

Each measure of location has its strengths and is suitable for different types of data and research
questions. Choosing the right measure of central tendency depends on the nature of the dataset
and the goals of the analysis. It's also common to use multiple measures in combination to gain a
more comprehensive understanding of the data's central tendency.

MEASURES OF DISPERSION

Introduction to Variation:

Variation is a fundamental concept in statistics and data analysis. It refers to the degree of
difference or fluctuation in data points within a dataset. Understanding variation is crucial in
various fields, including business, economics, science, and quality control, as it helps us make
sense of data, identify patterns, and make informed decisions. Studying variation allows us to
gain insights into the underlying processes and factors that affect data, ultimately leading to
improved decision-making and problem-solving.

Definition of Variation:
Variation is the extent to which data points or observations differ within a dataset. It can
manifest as differences in values, patterns, or trends. Variation can be classified into different
types, such as:

1. Common Cause Variation: This variation is inherent in any process and arises from random or
common factors. It represents the natural variability in data and is typically expected.
2. Special Cause Variation: Special cause variation results from specific, identifiable factors not
part of the normal process. It indicates unusual or unexpected deviations from the expected
pattern.

Significance of Studying Variation:

Understanding and studying variation is significant for several reasons:

1. Quality Improvement: In quality control and manufacturing, identifying and reducing variation
is critical for producing consistent and high-quality products or services.
2. Data Analysis: Variation analysis helps data analysts and researchers identify trends, anomalies,
and potential outliers within datasets, leading to more accurate conclusions.
3. Process Improvement: In business and operations management, studying variation is essential
for process improvement, efficiency optimization, and cost reduction.
4. Risk Assessment: Variation analysis is used in risk assessment to identify potential risks and
uncertainties in financial forecasts, investment portfolios, and project timelines.
5. Decision-Making: Managers and decision-makers rely on variation analysis to make informed
decisions, set realistic goals, and allocate resources effectively.

Methods of Studying Variation:

1. Range:

 Description: The range is the simplest measure of dispersion and represents the difference
between the highest and lowest values in a dataset.
 Significance: It provides a quick assessment of the spread of data. However, it's sensitive to
outliers and may not capture the full picture of dispersion in the middle of the dataset.
 Formula: Range = Maximum Value - Minimum Value

2. Variance:

 Description: Variance measures the average of the squared differences between each data point
and the mean. It quantifies the overall variability in the dataset.
 Significance: Variance takes into account all data points, giving more weight to larger
deviations. It's widely used but can be influenced by extreme values.
 Formula: Variance (σ²) = Σ(xi - μ)² / N, where xi is each data point, μ is the mean, and N is the
number of data points.
3. Standard Deviation:

 Description: The standard deviation is the square root of the variance. It provides a measure of
dispersion that is in the same unit as the data.
 Significance: Standard deviation is a common and intuitive measure of spread. It quantifies how
much individual data points deviate from the mean.
 Formula: Standard Deviation (σ) = √(Variance)

4. Coefficient of Variation (CV):

 Description: CV expresses the standard deviation as a percentage of the mean. It allows for
comparing the relative variation between datasets with different means.
 Significance: It's useful for comparing the variability of datasets with different scales or units.
 Formula: CV = (Standard Deviation / Mean) * 100

5. Interquartile Range (IQR):

 Description: IQR measures the range between the 25th and 75th percentiles (Q1 and Q3). It's
resistant to outliers.
 Significance: IQR provides a robust measure of the spread in the middle 50% of the data,
making it useful for skewed distributions.
 Formula: IQR = Q3 - Q1, where Q3 is the 75th percentile and Q1 is the 25th percentile.

6. Mean Absolute Deviation (MAD):

 Description: MAD calculates the average of the absolute differences between each data point
and the mean. It's less sensitive to outliers than the standard deviation.
 Significance: MAD provides a robust measure of dispersion that can be useful when dealing
with datasets containing extreme values.
 Formula: MAD = Σ|xi - μ| / N, where |xi - μ| represents the absolute differences.

Coefficient of Quartile Variation (CQV):

 Description: CQV compares the IQR to the median. It measures the relative spread of data
within the interquartile range.
 Significance: CQV helps assess the relative variability of data within the middle 50% of the
dataset.
 Formula: CQV = (IQR / Median) * 100
 These methods of studying dispersion are essential for understanding the spread, variability, and
distribution of data. The choice of method depends on the characteristics of the dataset and the
specific goals of the analysis, including the need for resistance to outliers and the desire to
express variability relative to other statistics like the mean or median.

Measures of association
Methods of association, also known as measures of association or correlation analysis, are
statistical techniques used to quantify and understand the relationships between variables in a
dataset. These methods are fundamental in various fields, including statistics, social sciences,
economics, epidemiology, and data analysis. They help researchers and analysts uncover
patterns, dependencies, and interactions between variables, which in turn enables better decision-
making, prediction, and hypothesis testing.

Significance of Methods of Association:

The significance of methods of association lies in their ability to:

 Identify Relationships: These methods help identify and quantify relationships between
variables, allowing researchers to understand how changes in one variable are associated
with changes in another.
 Predict Outcomes: By quantifying associations, researchers can make predictions about
one variable based on the values of others, which is valuable for forecasting and decision-
making.
 Hypothesis Testing: Correlation analysis is used to test hypotheses about the existence
and strength of relationships, helping researchers draw meaningful conclusions from data.
 Data Exploration: These methods are essential for exploratory data analysis, enabling
researchers to gain insights into complex datasets.
 Variable Selection: In regression modeling and machine learning, measures of
association assist in selecting relevant predictors and building effective models.

ProbabilityDistribution
A probability distribution is a fundamental concept in probability theory and statistics that
describes how the likelihood of different outcomes or events is spread or distributed over a
sample space. It provides a mathematical framework for quantifying uncertainty and making
predictions in a wide range of fields, including mathematics, science, economics, engineering,
and social sciences. Here are some elaborative notes on probability distributions:

1. Definition of Probability Distribution:


 A probability distribution is a mathematical function or model that assigns probabilities
to each possible outcome of a random experiment or random variable.
 It summarizes the behavior of a random variable, describing the likelihood of various
values it can take

Elements

1. Random Variables:
 A random variable is a variable that takes on different values based on the outcomes of a
random experiment.
 Random variables can be classified into two types: discrete and continuous.
 Discrete Random Variables: Take on a countable number of distinct values (e.g.,
the number of heads in a series of coin tosses).
 Continuous Random Variables: Can take on any value within a range (e.g., the
height of individuals in a population).
2. Probability Mass Function (PMF):
 For discrete random variables, the probability distribution is described by a Probability
Mass Function (PMF).
 The PMF provides the probabilities associated with each possible value of the random
variable.
 Properties of a PMF:
 Sum of probabilities equals 1.
 Non-negative probabilities for each value.
 P(X = x) represents the probability of the random variable X taking on the value
x.
3. Probability Density Function (PDF):
 For continuous random variables, the probability distribution is described by a
Probability Density Function (PDF).
 The PDF provides the relative likelihood of a random variable taking on a specific range
of values.
 Properties of a PDF:
 Area under the curve equals 1.
 The PDF doesn't directly give probabilities for specific values; instead, it gives
probabilities for intervals.
4. Common Probability Distributions:
 There are several well-known probability distributions, including:
 Bernoulli Distribution: Models a binary outcome (e.g., success/failure).
 Binomial Distribution: Models the number of successes in a fixed number of
Bernoulli trials.
 Poisson Distribution: Models the number of events occurring in a fixed interval
of time or space.
 Normal Distribution (Gaussian): Characterized by a bell-shaped curve and
often used for modeling continuous data.
 Exponential Distribution: Models the time between events in a Poisson process.
 Uniform Distribution: All values in a range are equally likely.
5. Moments of a Distribution:
 Moments of a probability distribution describe various characteristics of the distribution.
 The first moment is the mean (expected value), the second moment is the variance, and
the square root of the variance is the standard deviation.
6. Use in Statistics:
 Probability distributions are essential in statistical analysis, hypothesis testing, and
parameter estimation.
 They help quantify uncertainty, model real-world phenomena, and make predictions.
7. Simulation and Modeling:
 Probability distributions are used in simulations to model random events and make
predictions in various fields, such as finance, engineering, and computer science.
8. Central Limit Theorem:
 The Central Limit Theorem states that the sum or average of a large number of
independent and identically distributed random variables approaches a normal
distribution, regardless of the original distribution of the variables.
9. Applications:
 Probability distributions find applications in diverse fields, including risk assessment,
quality control, genetics, economics, and machine learning.

Understanding and working with probability distributions is crucial for making informed
decisions in situations involving uncertainty and randomness. Different situations may require
the use of specific probability distributions that best capture the underlying phenomena.

Probability Distributions

Probability distributions have numerous applications in the business world, where they are used
to model and analyze uncertainty, make informed decisions, and manage risks. Here are some
key applications of probability distributions in business:

1. Financial Risk Management:


 Probability distributions, particularly the normal distribution, are widely used in finance
to model asset returns, portfolio risk, and investment outcomes.
 Value at Risk (VaR) calculations rely on probability distributions to estimate potential
financial losses due to market fluctuations.
2. Insurance Pricing and Claims:
 Insurance companies use probability distributions to model the frequency and severity of
insurance claims.
 Actuaries employ distributions like the Poisson and gamma distributions to estimate the
probability of specific events, such as accidents or health-related claims.
3. Stock Price Forecasting:
 Probability distributions, such as the geometric Brownian motion model, are used in
stock price forecasting and options pricing.
 Traders and investors rely on these models to make investment decisions and assess risk.
4. Marketing and Sales:
 Marketers use probability distributions to model customer behavior, predict demand, and
optimize pricing strategies.
 Sales forecasting often involves probability distributions to estimate future sales volumes.
5. Supply Chain Management:
 Probability distributions help in forecasting demand, determining inventory levels, and
optimizing supply chain operations.
 Companies can better plan production and distribution with accurate demand predictions.
6. Credit Risk Assessment:
 Lenders and financial institutions use probability distributions to assess credit risk and
determine the probability of loan defaults.
 Credit scoring models incorporate these distributions to make lending decisions.
7. Quality Control:
 In manufacturing and service industries, probability distributions are used to monitor and
control product quality.
 Control charts and process capability analysis rely on these distributions to detect
deviations from desired quality standards.
8. Project Management:
 Project managers use probability distributions to estimate project durations, costs, and
resource requirements.
 Monte Carlo simulations can provide a range of possible project outcomes, aiding in
project planning and risk management.
9. Human Resources:
 Probability distributions are used in HR for workforce planning, talent acquisition, and
performance evaluation.
 They help in predicting employee turnover rates and optimizing staffing levels.
10. Market Research and Consumer Behavior:
 Probability distributions are used to analyze survey data, customer preferences, and
consumer behavior.
 Businesses gain insights into market trends and consumer choices through statistical
analysis.

Discrete Probability Distribution

A discrete probability distribution is a statistical distribution that describes the probabilities of all
possible outcomes or values of a discrete random variable. Discrete random variables are those
that can only take on a finite or countable number of distinct values. In a discrete probability
distribution, each possible value of the random variable is associated with a probability that sums
to 1. Here are some key characteristics and examples of discrete probability distributions:

Characteristics of Discrete Probability Distributions:

1. Countable Outcomes: Discrete random variables can take on values that are countable, such as
integers or specific categories.
2. Probability Mass Function (PMF): A discrete probability distribution is typically defined by its
Probability Mass Function (PMF). The PMF specifies the probability associated with each
possible value of the random variable.
3. Probability Sum: The sum of the probabilities for all possible outcomes must equal 1, as the
random variable must take on one of these values.
4. Disjoint Outcomes: The events corresponding to different values of the random variable are
mutually exclusive, meaning that only one of them can occur at a time.
5. Example: A common example of a discrete probability distribution is the probability distribution
for the outcome of rolling a fair six-sided die. The random variable can take on values 1, 2, 3, 4,
5, or 6, each with a probability of 1/6.

Examples of Discrete Probability Distributions:


1. Bernoulli Distribution: This is a simple discrete probability distribution that models a binary
outcome, such as success/failure or yes/no. It has two possible values, often denoted as 0 and 1,
with probabilities p and 1-p, respectively.
2. Binomial Distribution: The binomial distribution models the number of successes in a fixed
number of independent Bernoulli trials. It is characterized by two parameters: n (the number of
trials) and p (the probability of success in each trial).
3. Poisson Distribution: The Poisson distribution models the number of events occurring in a fixed
interval of time or space. It is often used for counting rare events and is characterized by the
parameter λ (lambda), which represents the average rate of occurrence.
4. Geometric Distribution: This distribution models the number of trials needed to achieve the
first success in a sequence of independent Bernoulli trials, where each trial has probability p of
success.
5. Hypergeometric Distribution: It models the probability of drawing a specific number of
successes from a finite population without replacement. Parameters include the population size,
the number of successes in the population, and the sample size.
6. Multinomial Distribution: This distribution generalizes the binomial distribution to situations
with more than two possible outcomes. It models the probabilities of observing a particular
combination of outcomes in a series of independent trials.
7. Negative Binomial Distribution: Similar to the geometric distribution, it models the number of
trials needed to achieve a specified number of successes in a sequence of independent Bernoulli
trials.

Understanding discrete probability distributions is essential in various fields, including statistics,


finance, engineering, and social sciences. They provide a mathematical framework for modeling
and analyzing uncertain and countable events, aiding in decision-making, risk assessment, and
statistical inference

Continuous probability Distribution


A continuous probability distribution is a statistical distribution that describes the probabilities
associated with the values of a continuous random variable. Continuous random variables can
take on any value within a specified range, and unlike discrete random variables, they have an
infinite number of possible outcomes. Continuous probability distributions are characterized by a
Probability Density Function (PDF), which represents the relative likelihood of the random
variable taking on a particular range of values. Here are some key characteristics and examples
of continuous probability distributions:

Characteristics of Continuous Probability Distributions:

1. Infinite Outcomes: Continuous random variables can take on an infinite number of values
within a specified interval. This interval is often denoted as [a, b], where 'a' and 'b' represent the
lower and upper bounds.
2. Probability Density Function (PDF): The PDF is the counterpart of the Probability Mass
Function (PMF) for discrete distributions. It defines the probability density or likelihood of the
random variable falling within a particular interval.
3. Probability Integration: Instead of summing probabilities as in discrete distributions,
probabilities in continuous distributions are calculated by integrating the PDF over a specified
range.
4. Probability at a Single Point: For continuous random variables, the probability of the variable
taking on a specific point value is technically zero. In other words, P(X = x) = 0 for any single
value 'x.'
5. Example: A common example of a continuous probability distribution is the normal distribution
(Gaussian distribution), which describes many natural phenomena and is characterized by its
bell-shaped curve.

Examples of Continuous Probability Distributions:

1. Normal Distribution: The normal distribution is widely used to model continuous data in
various fields, including finance, biology, and engineering. It is characterized by two parameters:
the mean (μ) and the standard deviation (σ).
2. Exponential Distribution: This distribution models the time between events in a Poisson
process, such as the arrival of customers at a service center or the decay of radioactive particles.
3. Uniform Distribution: The uniform distribution describes a situation where all values within a
specified interval are equally likely. It is often used for random sampling and simulation.
4. Lognormal Distribution: The lognormal distribution is used to model data that is skewed
positively and can be transformed into a normal distribution by taking the natural logarithm of
the data.
5. Weibull Distribution: Commonly used in reliability engineering, the Weibull distribution
models the time until failure of a product or system.
6. Gamma Distribution: The gamma distribution is versatile and used to model waiting times,
income distributions, and other continuous data.
7. Beta Distribution: The beta distribution is often used in Bayesian statistics to model the
probability distribution of a random variable constrained to a finite range, such as proportions or
probabilities.
8. Cauchy Distribution: The Cauchy distribution has heavy tails and is used in physics and
engineering for modeling certain physical phenomena.
9. Triangular Distribution: This distribution is often used when data is assumed to be bounded by
minimum and maximum values and follows a triangular shape.
10. Pareto Distribution: Commonly used in economics, the Pareto distribution
describes the distribution of income and wealth, where a small percentage of the population
holds the majority of resources.

Continuous probability distributions are essential for modeling and analyzing real-world
phenomena, as they allow for the representation of continuous and uncountable data. They are
particularly valuable in fields where precision and accuracy in statistical analysis are crucial,
such as finance, engineering, and scientific research
Data Modeling

Data modeling is a fundamental process in data management, database design, and information
systems that involves creating a structured representation of data and its relationships to enable
efficient data storage, retrieval, and analysis. Here are elaborative notes on data modeling:

1. Purpose of Data Modeling:

 Data modeling serves several essential purposes:


 Data Organization: It organizes complex data into a structured format, making it easier
to manage and understand.
 Data Integrity: It enforces data integrity rules, ensuring the accuracy and consistency of
data.
 Data Retrieval: It helps optimize data retrieval by defining how data is stored and
accessed.
 Communication: It provides a visual representation of data structures, aiding
communication between stakeholders.
 Scalability: It allows for the scalability of databases as data volumes grow.

2. Types of Data Models:

 There are three main types of data models:


 Conceptual Data Model: This model represents the highest-level view of data and
relationships within an organization. It focuses on business concepts and relationships,
often using high-level diagrams like Entity-Relationship Diagrams (ERDs).
 Logical Data Model: This model defines data elements, their attributes, and the
relationships between them in detail. It is more concrete than a conceptual model and is
often used as a blueprint for database design.
 Physical Data Model: This model specifies how data is physically stored in a database
system. It includes details such as tables, columns, indexes, and storage structures. It is
used for database implementation.

3. Entities and Attributes:

 Data models typically involve entities (also called objects or concepts) and their attributes
(characteristics or properties).
 Entities are represented as tables in a relational database, and each row in a table represents an
instance of an entity.
 Attributes describe the properties or features of entities, and each attribute corresponds to a
column in a table.
4. Relationships:

 Data models depict relationships between entities. Common relationship types include one-to-
one, one-to-many, and many-to-many.
 Relationships are often represented graphically in Entity-Relationship Diagrams (ERDs) using
lines connecting related entities.

5. Normalization:

 Normalization is a process applied to data models to eliminate data redundancy and improve data
integrity.
 It involves organizing data into separate tables and linking them through relationships.
 The goal is to minimize data duplication and ensure that each piece of data is stored in only one
place in the database.

6. Data Modeling Tools:

 Various software tools are available to assist in creating and visualizing data models. These tools
include:
 ERD Tools: Specialized tools for creating Entity-Relationship Diagrams.
 Database Management Systems (DBMS): Many DBMSs offer built-in data modeling
capabilities.
 General-purpose Modeling Tools: Tools like Microsoft Visio and Lucidchart can be
used for data modeling.

7. Data Modeling Languages:

 Data models can be represented using standardized modeling languages, such as:
 Unified Modeling Language (UML): Widely used for modeling not only data but also
software systems.
 Data Modeling Notation (DMN): A specific notation for data modeling purposes.

8. Iterative Process:

 Data modeling is often an iterative process that evolves as project requirements become clearer.
 Models may undergo revisions and refinements as stakeholders' needs change.

9. Real-world Applications:

 Data modeling is essential in various domains, including database design, software development,
business analysis, and data warehousing.
 It plays a crucial role in industries such as finance, healthcare, retail, and manufacturing.

In conclusion, data modeling is a structured and systematic approach to representing data and its
relationships, ensuring data accuracy, consistency, and efficient management. It is a foundational
step in database design and information system development, facilitating effective data
organization and utilization in various applications and industries

Distribution Fitting:

Distribution fitting is a statistical technique used to identify and select the probability distribution
that best describes a given dataset. Here are short notes on distribution fitting:

1. Purpose: Distribution fitting is used to model the underlying data-generating process. It helps
statisticians and analysts understand the distribution of data and make predictions.
2. Types of Distributions: Common probability distributions used for fitting include normal
(Gaussian), exponential, Poisson, binomial, and many others. The choice of distribution depends
on the nature of the data.
3. Goodness-of-Fit Tests: Statistical tests, such as the Kolmogorov-Smirnov test or chi-square test,
are used to assess how well a chosen distribution fits the data. These tests compare observed data
with expected values from the distribution.
4. Parameters Estimation: For many distributions, there are parameters (e.g., mean and standard
deviation for a normal distribution) that need to be estimated from the data. Maximum likelihood
estimation is a common method for parameter estimation.
5. Visual Inspection: Data analysts often start by plotting the data and comparing it to the
probability distribution graphically. This helps identify deviations from the assumed distribution.
6. Real-world Applications: Distribution fitting is used in various fields, including finance (e.g.,
stock returns), biology (e.g., species population data), and quality control (e.g., defect counts).
7. Fitting Algorithms: Software tools like R, and Python (with libraries like SciPy and
statsmodels), and specialized statistical packages provide functions and algorithms for
distribution fitting.
8. Limitations: Distribution fitting assumes that the data follows a specific distribution, which may
not always be the case. It's a simplifying assumption that may not capture complex data patterns.

In summary, data modeling and distribution fitting are essential techniques in data analysis and
statistics. Data modeling structures and simplifies data for analysis, while distribution fitting
helps identify the probability distribution that best describes the data, aiding in making informed
statistical inferences and predictions.

Random Sampling from Probability Distribution


Random sampling from a probability distribution is a fundamental concept in statistics and data
analysis. It involves selecting data points or values from a population or probability distribution
in a way that each possible value has a known or assigned probability of being selected. This
process allows statisticians and data analysts to draw conclusions, make inferences, and conduct
simulations. Here is an elaborative explanation of random sampling from a probability
distribution:

1. Probability Distribution Selection:


 Before performing random sampling, you need to choose an appropriate probability distribution
that models the data or phenomenon you are interested in. This choice depends on the nature of
the data and the problem at hand. Common distributions include the normal distribution,
exponential distribution, Poisson distribution, and more.

2. Probability Density Function (PDF) or Probability Mass Function (PMF):

 Each probability distribution is characterized by its Probability Density Function (PDF) for
continuous distributions or Probability Mass Function (PMF) for discrete distributions. The PDF
or PMF defines the probabilities associated with different values or ranges of values.

3. Population or Sample:

 You should identify whether you are dealing with a population (the entire set of data) or a
sample (a subset of the population). The methods may vary depending on this distinction.

4. Random Number Generation:

 To perform random sampling, you need a source of random numbers. Modern programming
languages and statistical software provide functions or libraries for generating random numbers
that follow a uniform distribution between 0 and 1.

5. Inverse Transform Method:

 The Inverse Transform Method is a common technique for random sampling from probability
distributions. It involves the following steps:
 Calculate the Cumulative Distribution Function (CDF) for the chosen distribution. The
CDF provides the cumulative probability up to a given value.
 Generate a random number from a uniform distribution between 0 and 1.
 Use the inverse of the CDF to map the random number to a specific value from the
chosen distribution.

6. Repeated Sampling:

 To obtain multiple random samples, repeat the random sampling process as many times as
needed. Each iteration generates one value or data point according to the chosen probability
distribution.

7. Sample Size:

 The sample size determines how many random values or data points you generate. The sample
size should be determined based on the requirements of your analysis or study.

8. Data Analysis and Interpretation:


 Once you have generated the random samples, you can perform various statistical analyses on
them. You can calculate sample statistics, construct confidence intervals, conduct hypothesis
tests, or simulate scenarios for decision-making.

9. Applications:

 Random sampling from probability distributions has wide-ranging applications in fields such as:
 Finance: For risk assessment and portfolio optimization.
 Manufacturing: To simulate production processes and quality control.
 Epidemiology: For disease modeling and forecasting.
 Engineering: In reliability analysis and system design.
 Social Sciences: To model and simulate various social phenomena.

10. Limitations: - It's important to note that random sampling assumes that the chosen
probability distribution accurately represents the real-world phenomenon. If the distribution is
not a good fit, the results may not be representative or valid.

In summary, random sampling from a probability distribution is a powerful tool for generating
data points that follow a specific statistical distribution. It is a fundamental technique in statistics,
simulation, and various scientific and engineering fields, enabling data-driven decision-making
and analysis
UNIT – V: PREDICTIVE ANALYTICS

Introduction to predictive Analytics: Karl Pearson Correlation Technique, Multiple


Correlation, Spearman’s Rank Correlation,
Regression: Simple and Multiple Regression, Regression by the Method of Least Squares,
Building Good Regression Models. Regression with Categorical Independent Variables,
Linear Discriminate Analysis, One-Way and Two-Way ANOVA.

KARL PEARSON CORRELATION COEFFICIENT

Karl Pearson's correlation coefficient, often referred to as Pearson's r or simply the Pearson
correlation is a statistical method used to measure the linear association between two continuous
variables. Here are some key points to note about Pearson's correlation technique:

1. Definition: Pearson's correlation coefficient is a statistic that quantifies the strength and direction
of the linear relationship between two continuous variables. It provides a value between -1 and 1,
where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear
relationship, and 0 indicates no linear relationship.
2. Formula: The formula for Pearson's correlation coefficient is given by:
r = (Σ((X - X̄)(Y - Ȳ))) / [√(Σ(X - X̄)²) * √(Σ(Y - Ȳ)²)]
Where:
 r is the Pearson correlation coefficient.
 X and Y are the individual data points.
 X̄ and Ȳ are the means of the X and Y data, respectively.
3. Interpretation:
 r = 1: Indicates a perfect positive linear relationship.
 r = 0: Suggests no linear relationship.
 r = -1: Suggests a perfect negative linear relationship.
 Values between -1 and 1 indicate the strength and direction of the linear relationship. The
closer the absolute value of r is to 1, the stronger the relationship.
4. Assumptions:
 Pearson's correlation assumes that both variables are normally distributed.
 It assumes that the relationship between the variables is linear.
 It is sensitive to outliers, so extreme data points can significantly affect the correlation
coefficient.
5. Use Cases:
 Pearson's correlation is commonly used in various fields, including statistics, economics,
social sciences, and more.
 It is used to explore and quantify relationships between variables, such as the correlation
between income and education, temperature and ice cream sales, or test scores and study
time.
6. Strengths:
 It provides a simple and intuitive measure of the linear relationship between two
variables.
 The coefficient is standardized, making it easier to compare the strength of relationships
across different studies or data sets.
 It is widely used and accepted in statistical analysis.
7. Limitations:
 Pearson's correlation only measures linear relationships, so it may not capture nonlinear
associations.
 It is sensitive to outliers and can be influenced by extreme values.
 It does not imply causation. A significant correlation does not necessarily mean that one
variable causes the other.
8. Statistical Significance:
 Hypothesis testing can be used to determine if the calculated correlation coefficient is
statistically significant. This involves testing the null hypothesis that there is no
correlation in the population.
9. Alternative Correlation Coefficients:
 If the assumptions of Pearson's correlation are not met, other correlation coefficients, like
Spearman's rank correlation or Kendall's tau, can be used to assess relationships between
variables.

Pearson's correlation is a valuable tool for analyzing and understanding the linear relationships
between continuous variables. However, it is important to be aware of its assumptions and
limitations when applying it to real-world data

Multiple correlation, in statistics, quantifies the linear relationship between a single dependent
variable and multiple independent variables. It essentially tells you how well you can predict the
dependent variable based on a linear combination of the independent variables

1. Types of Correlation:
 Positive Correlation: Both variables move in the same direction.
 Negative Correlation: Variables move in opposite directions.
 No Correlation: There is no discernible relationship.
2. Correlation Coefficient:
 The correlation coefficient quantifies the strength and direction of the relationship.
 Common coefficients include Pearson's correlation coefficient (for linear relationships)
and Spearman's rank correlation coefficient (for monotonic relationships).
3. Limitations:
 Correlation does not imply causation.
 Outliers can influence results.
 Non-linear relationships may not be captured.

Multiple Correlation:

1. Definition:
 Multiple correlation involves examining the relationship between a dependent variable
and two or more independent variables simultaneously.
2. Multiple Correlation Coefficient:
 The multiple correlation coefficient (�R) measures the strength and direction of the
linear relationship between the dependent variable and a combination of independent
variables.
3. Regression Analysis:
 Often used in conjunction with regression analysis, where the goal is to predict the
dependent variable using a linear combination of independent variables.
4. Interpretation:
 2R2 (coefficient of determination) indicates the proportion of variance in the dependent
variable explained by the independent variables.
5. Assumptions:
 Linearity: The relationships are assumed to be linear.
 Independence: Observations should be independent.
 Homoscedasticity: The variance of errors should be constant

Applications:

 Regression analysis: Multiple correlation forms the basis for linear regression, where you predict
a continuous dependent variable based on multiple independent variables.
 Multivariate analysis: It can be used in various multivariate techniques like factor analysis and
principal component analysis to understand relationships between multiple variables.
 Scientific research: It's widely used in various fields to investigate the relationships between
different factors influencing a phenomenon.

Spearman’s Rank correlation

Spearman's rank correlation, often referred to as Spearman's rho (ρ), is a non-parametric


statistical method used to assess the strength and direction of the monotonic relationship between
two variables. It's particularly useful when dealing with non-normally distributed data or data
that does not have a linear relationship. Here are comprehensive notes on Spearman's rank
correlation:

1. Definition:
 Spearman's rank correlation is a non-parametric measure of association that quantifies the degree
and direction of the monotonic relationship between two variables.
 It is often used when the data does not meet the assumptions of parametric tests, such as the
assumption of a linear relationship or normally distributed data.

2. Formula:

 To compute Spearman's rank correlation, follow these steps:


 Rank the values of both variables separately.
 Calculate the differences between the ranks of each pair of corresponding values.
 Square these differences.
 Sum the squared differences.
 Use the formula: ρ = 1 - (6Σd²) / (n(n² - 1))
 ρ represents the Spearman's rank correlation coefficient.
 d represents the differences in ranks.
 n is the number of data points.

3. Interpretation:

 Spearman's ρ ranges from -1 to 1, where -1 indicates a perfect negative monotonic relationship, 1


indicates a perfect positive monotonic relationship, and 0 indicates no monotonic relationship.
 Positive ρ values suggest that as one variable increases, the other tends to increase, and vice
versa for negative ρ values.

4. Use Cases:

 Spearman's rank correlation is used when working with data that doesn't meet the assumptions of
parametric methods.
 It is valuable for assessing relationships between ordinal, ranked, or non-normally distributed
data.
 Common applications include analyzing the relationship between customer preferences, exam
scores, or performance rankings.

5. Advantages:

 Robust to outliers: It is less influenced by extreme values, making it suitable for data with
outliers.
 No distributional assumptions: Unlike parametric tests, Spearman's rank correlation does not
assume the normality of data.
 Handles ordinal data: It is appropriate for data with ranked or ordered categories.

6. Limitations:

 Limited sensitivity: It may not be as sensitive as parametric tests to subtle monotonic


relationships.
 Does not capture linearity: It does not detect linear relationships.
 Assumes monotonicity: Spearman's correlation assumes that the relationship is strictly
monotonic, which may not always be the case.

7. Statistical Significance:

 Hypothesis testing can be used to determine if the calculated Spearman's ρ is statistically


significant. This involves testing the null hypothesis that there is no monotonic relationship in
the population.

8. Software and Tools:

 Statistical software packages like R, and Python (with libraries like Scipy or Statsmodels), and
dedicated statistical software such as SPSS and SAS can be used to calculate Spearman's rank
correlation.

Spearman's rank correlation is a valuable tool when dealing with data that doesn't conform to the
assumptions of parametric tests. It provides a non-parametric measure of association that is
robust, especially in the presence of outliers or non-normally distributed data, making it widely
used in various fields for assessing relationships between variables

What is Regression Analysis?

Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to assess
the strength of the relationship between variables and for modeling the future relationship
between them.

Multiple Regression Definition

Multiple regression analysis is a statistical technique that analyzes the relationship between two
or more variables and uses the information to estimate the value of the dependent variables. In
multiple regression, the objective is to develop a model that describes a dependent variable y to
more than one independent variable.
Multiple Regression Formula

In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.

The multiple regression equation is given by

y = a + b 1×1+ b2×2+……+ bkxk

where x1, x2, ….xk are the k independent variables and y is the dependent variable.

Stepwise Multiple Regression

Stepwise regression is a step-by-step process that begins by developing a regression model with
a single predictor variable and adding and deletingthe predictor variable one step at a time.
Stepwise multiple regression is the method to determine a regression equation that begins with a
single independent variable and adds independent variables one by one. The stepwise multiple
regression method is also known as the forward selection method because we begin with no
independent variables and add one independent variable to the regression equation at each of the
iterations. There is another method called the backward elimination method, which begins with
an entire set of variables and eliminates one independent variable at each of the iterations.

Residual: The variations in the dependent variable explained by the regression model are called
residual or error variation. It is also known as random error or sometimes just “error”. This is a
random error due to different sampling methods

Advantages of Stepwise Multiple Regression

 Only independent variables with non-zero regression coefficients are included in the
regression equation.
 The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
 The stepwise multiple regression is efficient in finding the regression equation with only
significant regression coefficients.
 The steps involved in developing the regression equation are clear.

Least Square Method Definition


The least-squares method is a crucial statistical method that is practiced to find a regression line
or a best-fit line for the given pattern. This method is described by an equation with specific
parameters. The method of least squares is generously used in evaluation and regression. In
regression analysis, this method is said to be a standard approach for the approximation of sets of
equations having more equations than the number of unknowns.

The method of least squares defines the solution for the minimization of the sum of squares of
deviations or the errors in the result of each equation. Find the formula for the sum of squares of
errors, which helps to find the variation in observed data.

The least-squares method is often applied in data fitting. The best-fit result is assumed to reduce
the sum of squared errors or residuals which are stated to be the differences between the
observed or experimental value and the corresponding fitted value given in the model.

There are two basic categories of least-squares problems:

 Ordinary or linear least squares


 Nonlinear least squares
These depend upon the linearity or nonlinearity of the residuals. The linear problems are often
seen in regression analysis in statistics. On the other hand, the non-linear problems are generally
used in the iterative method of refinement in which the model is approximated to the linear one
with each iteration.

Least Square Method Graph

In linear regression, the line of best fit is straight as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each
point from the line. Vertical offsets are generally used in surface, polynomial, and hyperplane
problems, while perpendicular offsets are utilized in common practice.
Least Square Method Formula

The least-square method states that the curve that best fits a given set of observations, is said to
be a curve having a minimum sum of the squared residuals (or deviations or errors) from the
given data points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents the error or deviation from each given
point.

Now, we can write:

d1 = y1 − f(x1)

d2 = y2 − f(x2)

d3 = y3 − f(x3)

…..

dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum
of squares of all the deviations from given values must be minimum, i.e:

Sum = Minimum Quantity

Suppose we have to determine the equation of the line of best fit for the given data, then we first
use the following formula.

The equation of least square line is given by Y = a + bX

The normal equation for ‘a’:


∑Y = na + b∑X

The normal equation for ‘b’:

∑XY = a∑X + b∑X2

By solving these two normal equations we can get the required trend line equation.

Thus, we can get the line of best fit with the formula y = ax + b

Building Good Regression Models.


Building good regression models is an iterative process that requires careful consideration of
various factors. Here's a roadmap to guide you through the process:

1. Understand your data:

 What is the problem you're trying to solve?


 What data do you have available, and what is its quality?
 What is the relationship between the independent and dependent variables (linear, non-
linear, etc.)?

2. Data preparation:

 Clean and pre-process your data: handle missing values, outliers, and scaling.
 Explore and visualize your data to understand its distribution and potential relationships.

3. Model selection:

 Choose the appropriate regression model based on your data and the research question. Some
common choices include:
o Linear regression: for modeling linear relationships.
o Logistic regression: for modeling binary outcomes (e.g., yes/no).
o Decision trees: for non-linear relationships and complex interactions.
o Random forest: an ensemble method combining multiple decision trees for improved accuracy.

4. Model training and evaluation:

 Train your model on a subset of your data.


 Evaluate the model's performance on a separate hold-out dataset using metrics like:
o Mean squared error (MSE): for continuous variables, measures the average squared difference
between predicted and actual values.
o R-squared: for continuous variables, indicates the proportion of variance in the dependent
variable explained by the model.
o Accuracy: for classification tasks, measures the percentage of correct predictions.

5. Model improvement:

 Analyze the model's residuals and identify potential areas for improvement.
 Try different model configurations (e.g., different features, hyperparameters) and compare their
performance.
 Regularize your model to prevent overfitting (e.g., L1 or L2 regularization).

6. Model interpretation and deployment:

 Interpret the model coefficients to understand the relationships between variables.


 Validate the model's generalizability on unseen data.
 Deploy the model for prediction or decision-making purpose

Categorical independent variables:


Categorical independent variables, also known as categorical predictors or factors, are variables
that can take on values from a limited, often fixed, set of categories. These variables are
qualitative and do not have a natural order or numerical interpretation. Categorical variables can
be broadly classified into two types: nominal and ordinal.

1. Nominal Variables:
 Nominal variables represent categories with no inherent order or ranking. Examples
include:
 Gender: Male, Female, Non-Binary
 Color: Red, Blue, Green
 Country: USA, Canada, India
When dealing with nominal variables in regression analysis, dummy coding or one-hot encoding
is commonly used to convert these categories into numerical values that can be used in
regression models.
2. Ordinal Variables:
 Ordinal variables, on the other hand, have a meaningful order or ranking among their
categories, but the intervals between the categories are not uniform. Examples include:
 Education Level: High School, Bachelor's, Master's, PhD
 Income Bracket: Low, Medium, High
Ordinal variables can be treated similarly to nominal variables, but you might also consider
ordinal coding, where numerical values are assigned based on the order of the categories.

In regression analysis, it's important to handle categorical variables appropriately, as their


inclusion directly in a model may lead to misinterpretation or incorrect results. Here are some
considerations:

 Dummy Coding (One-Hot Encoding):


 For nominal variables, one common approach is to use dummy coding. This involves
creating binary (0/1) variables for each category. This way, the presence or absence of a
category is represented without implying any ordinal relationship.
 Ordinal Coding:
 For ordinal variables, you might use ordinal coding, where numerical values are assigned
based on the order of the categories. However, this assumes that the intervals between the
categories are equal, which may not always be the case.
 Interaction Terms:
 When working with multiple categorical variables, you may want to explore interaction
terms to account for potential interactions between them.
 Statistical Software Handling:
 Many statistical software packages, like Python's sci-kit-learn or R, handle categorical
variables automatically. They internally perform the necessary encoding, and you need
not manually create dummy variables.
 Be Mindful of Interpretation:
 Interpretation of coefficients for categorical variables depends on the coding scheme
used. Always be mindful of the chosen coding method when interpreting the results.

Handling categorical variables appropriately is crucial for building accurate and interpretable
regression models. The choice of encoding method depends on the nature of the data and the
assumptions of the model being used

Linear Discriminant Analysis

Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification

Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique to
solve such classification problems. E.g., if we have two classes with multiple features and need
to separate them efficiently. When we classify them using a single feature, then it may show
overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.

Example:

Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:

However, it is impossible to draw a straight line in a 2-D plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane
into the 1-D plane. Using this technique, we can also maximize the separability between multiple
classes.
How does Linear Discriminant Analysis (LDA) work?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine


learning, using which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we
need to classify them efficiently. As we have already seen in the above example LDA enables us
to draw a straight line that can completely separate the two classes of data points. Here, LDA
uses an X-Y axis to create a new axis by separating them using a straight line and projecting data
onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can maximize
the distance between the means of the two classes and minimize the variation within each class.

In other words, we can say that the new axis will increase the separation between the data points
of the two classes and plot them onto the new axis.

Why LDA?
o Logistic Regression is one of the most popular classification algorithms that performs
well for binary classification but falls short in the case of multiple classification problems
with well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

However, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails
in some cases where the Mean of the distributions is shared.

ANOVA
A key statistical test in research fields including biology, economics, and psychology, analysis of
variance (ANOVA) is very useful for analyzing datasets. It allows comparisons to be made
between three or more groups of data. Here, we summarize the key differences between these
two tests, including the assumptions and hypotheses that must be made about each type of test.

Two types of ANOVA are commonly used,


The one-way ANOVA
Two-way ANOVA.
A one-way ANOVA is a type of statistical test that compares the variance in the group means
within a sample whilst considering only one independent variable or factor. It is a hypothesis-
based test, meaning that it aims to evaluate multiple mutually exclusive theories about our data.
Before we can generate a hypothesis, we need to have a question about our data that we want an
answer to. For example, adventurous researchers studying a population of walruses might ask
“Do our walruses weigh more in early or late mating season?” Here, the independent variable or
factor (the two terms mean the same thing) is the “month of mating season”. In an ANOVA, our
independent variables are organized in categorical groups. For example, if the researchers looked
at walrus weight in December, January, February, and March, there would be four months
analyzed, and therefore four groups to the analysis.

A one-way ANOVA compares three or more than three categorical groups to establish whether
there is a difference between them. Within each group, there should be three or more
observations (here, this means walruses), and the means of the samples are compared.

Hypothesis of a one-way ANOVA


In a one-way ANOVA, there are two possible hypotheses.

 The null hypothesis (H0) is that there is no difference between the groups and equality
between means (walruses weigh the same in different months).
 The alternative hypothesis (H1) is that there is a difference between the means and
groups (walruses have different weights in different months)

Assumptions and Limitations of One -way Anova

 Normality – that each sample is taken from a normally distributed population


 Sample independence – that each sample has been drawn independently of the other
samples
 Variance equality – that the variance of data in the different groups should be the same
 Your dependent variable – here, “weight”, should be continuous – that is, measured on a
scale that can be subdivided using increments (i.e. grams, milligrams)

TWO -WAY ANOVA

A two-way ANOVA is, like a one-way ANOVA, a hypothesis-based test. However, in the two-
way ANOVA each sample is defined in two ways, and resultingly put into two categorical
groups. Thinking again of our walruses, researchers might use a two-way ANOVA if their
question is: “Are walruses heavier in early or late mating season and does that depend on the sex
of the walrus?” In this example, both “month in mating season” and “sex of walrus” are factors –
meaning in total, there are two factors. Once again, each factor’s number of groups must be

The two-way ANOVA therefore examines the effect of two factors (month and sex) on a
dependent variable – in this case, weight, and also examines whether the two factors affect each
other to influence the continuous variable.

What are the assumptions and limitations of a two-way ANOVA?

 Your dependent variable – here, “weight”, should be continuous – that is, measured on a
scale that can be subdivided using increments (i.e. grams, milligrams)
 Your two independent variables – here, “month” and “sex”, should be in categorical,
independent groups.
 Sample independence – that each sample has been drawn independently of the other
samples
 Variance Equality – That the variance of data in the different groups should be the same
 Normality – That each sample is taken from a normally distributed population

Hypotheses of a two-way ANOVA

Because the two-way ANOVA considers the effect of two categorical factors, and the effect of
the categorical factors on each other, there are three pairs of null or alternative hypotheses for the
two-way ANOVA. Here, we present them for our walrus experiment, where the month of mating
season and sex are the two independent variables.

 H0: The means of all month groups are equal


 H1: The mean of at least one-month group is different

 H0: The means of the sex groups are equal


 H1: The means of the sex groups are different

 H0: There is no interaction between the month and gender


 H1: There is interaction between the month and gender

You might also like